AMD MI50 32GB/Vega20 GPU Passthrough Guide for Proxmox

Posted by Panda24z@reddit | LocalLLaMA | View on Reddit | 41 comments

What This Guide Solves

If you're trying to pass through an AMD Vega20 GPU (like the MI50 or Radeon Pro VII) to a VM in Proxmox and getting stuck with the dreaded "atombios stuck in loop" error, this guide is for you. The solution involves installing the vendor-reset kernel module on your Proxmox host.

Important note: This solution was developed after trying the standard PCIe passthrough setup first, which failed. While I'm not entirely sure if all the standard passthrough steps are required when using vendor-reset, I'm including them since they were part of my working configuration.

Warning: This involves kernel module compilation and hardware-level GPU reset procedures. Test this at your own risk.

My Setup

Here's what I was working with:

Part 1: Standard PCIe Passthrough Setup

Heads up: These steps might not all be necessary with vendor-reset, but I did them first and they're part of my working setup.

Helpful video reference: Proxmox PCIe Passthrough Guide

Enable IOMMU Support

If you're using a legacy boot system:

nano /etc/default/grub

Add this line:

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"
# Or for AMD systems:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on"

Then save and run:

update-grub

If you're using EFI boot:

nano /etc/kernel/cmdline

Add this:

intel_iommu=on
# Or for AMD systems:
amd_iommu=on

Then save and run:

proxmox-boot-tool refresh

Load VFIO Modules

Edit the modules file:

nano /etc/modules

Add these lines:

vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

Find Your GPU and Current Driver

First, let's see what we're working with:

# Find your AMD GPU
lspci | grep -i amd

# Get detailed info (replace 08:00 with your actual PCI address)
lspci -n -s 08:00 -v

Here's what I saw on my system:

08:00.0 0300: 1002:66a3 (prog-if 00 [VGA controller])
        Subsystem: 106b:0201
        Flags: bus master, fast devsel, latency 0, IRQ 44, NUMA node 0, IOMMU group 111
        Memory at b0000000 (64-bit, prefetchable) [size=256M]
        Memory at c0000000 (64-bit, prefetchable) [size=2M]
        I/O ports at 3000 [size=256]
        Memory at c7100000 (32-bit, non-prefetchable) [size=512K]
        Expansion ROM at c7180000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [200] Physical Resizable BAR
        Capabilities: [270] Secondary PCI Express
        Capabilities: [2a0] Access Control Services
        Capabilities: [2b0] Address Translation Service (ATS)
        Capabilities: [2c0] Page Request Interface (PRI)
        Capabilities: [2d0] Process Address Space ID (PASID)
        Capabilities: [320] Latency Tolerance Reporting
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu

Notice it shows "Kernel modules: amdgpu" - that's what we need to blacklist.

Configure VFIO and Blacklist the AMD Driver

echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf
echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf

# Blacklist the AMD GPU driver
echo "blacklist amdgpu" >> /etc/modprobe.d/blacklist.conf

Bind Your GPU to VFIO

# Use the vendor:device ID from your lspci output (mine was 1002:66a3)
echo "options vfio-pci ids=1002:66a3 disable_vga=1" > /etc/modprobe.d/vfio.conf

Apply Changes and Reboot

update-initramfs -u -k all
reboot

Check That VFIO Binding Worked

After the reboot, verify your GPU is now using the vfio-pci driver:

# Use your actual PCI address
lspci -n -s 08:00 -v

You should see:

Kernel driver in use: vfio-pci
Kernel modules: amdgpu

If you see Kernel driver in use: vfio-pci, the standard passthrough setup is working correctly.

Part 2: The vendor-reset Solution

This is where the magic happens for AMD Vega20 GPUs.

Check Your System is Ready

Make sure your Proxmox host has the required kernel features:

# Check your kernel version
uname -r

# Verify required features (all should show 'y')
grep -E "CONFIG_FTRACE=|CONFIG_KPROBES=|CONFIG_PCI_QUIRKS=|CONFIG_KALLSYMS=|CONFIG_KALLSYMS_ALL=|CONFIG_FUNCTION_TRACER=" /boot/config-$(uname -r)

# Find your GPU info again
lspci -nn | grep -i amd

You should see something like:

6.8.12-13-pve

CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KPROBES=y
CONFIG_PCI_QUIRKS=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y

08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro Vega II/Radeon Pro Vega II Duo] [1002:66a3]

Make note of your GPU's PCI address (mine is 08:00.0) - you'll need this later.

Install Build Dependencies

# Update and install what we need
apt update
apt install -y git dkms build-essential

# Install Proxmox kernel headers
apt install -y pve-headers-$(uname -r)

# Double-check the headers are there
ls -la /lib/modules/$(uname -r)/build

You should see a symlink pointing to something like /usr/src/linux-headers-X.X.X-X-pve.

Build and Install vendor-reset

# Download the source
cd /tmp
git clone https://github.com/gnif/vendor-reset.git
cd vendor-reset

# Clean up any previous attempts
sudo dkms remove vendor-reset/0.1.1 --all 2>/dev/null || true
sudo rm -rf /usr/src/vendor-reset-0.1.1
sudo rm -rf /var/lib/dkms/vendor-reset

# Build and install the module
sudo dkms install .

If everything goes well, you'll see output like:

Sign command: /lib/modules/6.8.12-13-pve/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub
Creating symlink /var/lib/dkms/vendor-reset/0.1.1/source -> /usr/src/vendor-reset-0.1.1
Building module:
Cleaning build area...
make -j56 KERNELRELEASE=6.8.12-13-pve KDIR=/lib/modules/6.8.12-13-pve/build...
Signing module /var/lib/dkms/vendor-reset/0.1.1/build/vendor-reset.ko
Cleaning build area...
vendor-reset.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/6.8.12-13-pve/updates/dkms/
depmod...

Configure vendor-reset to Load at Boot

# Tell the system to load vendor-reset at boot
echo "vendor-reset" | sudo tee -a /etc/modules

# Copy the udev rules that automatically set the reset method
sudo cp udev/99-vendor-reset.rules /etc/udev/rules.d/

# Update initramfs
sudo update-initramfs -u -k all

# Make sure the module file is where it should be
ls -la /lib/modules/$(uname -r)/updates/dkms/vendor-reset.ko

Reboot and Verify Everything Works

reboot

After the reboot, check that everything is working:

# Make sure vendor-reset is loaded
lsmod | grep vendor_reset

# Check the reset method for your GPU (use your actual PCI address)
cat /sys/bus/pci/devices/0000:08:00.0/reset_method

# Confirm your GPU is still detected
lspci -nn | grep -i amd

What you want to see:

vendor_reset            16384  0

device_specific

08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro Vega II/Radeon Pro Vega II Duo] [1002:66a3]

The reset method MUST device_specific. If it shows bus, the udev rules didn't work properly.

Part 3: VM Configuration

Add the GPU to Your VM

Through the Proxmox web interface:

  1. Go to your VM → Hardware → Add → PCI Device
  2. Select your GPU (like 0000:08:00)
  3. Check "All Functions"
  4. Apply the changes

Machine Type: I used q35 for my VM, I did not try the other options.

Handle Large VRAM

Since GPUs like the MI50 have tons of VRAM (32GB), you need to increase the PCI BAR size.
Edit your VM config file (/etc/pve/qemu-server/VMID.conf) and add this line:

args: -cpu host,host-phys-bits=on -fw_cfg opt/ovmf/X-PciMmio64Mb,string=65536

I opted to use this larger sized based on a recommendation from another reddit post.

Here's my complete working VM configuration for reference:

args: -cpu host,host-phys-bits=on -fw_cfg opt/ovmf/X-PciMmio64Mb,string=65536
bios: seabios
boot: order=scsi0;hostpci0;net0
cores: 8
cpu: host
hostpci0: 0000:08:00
machine: q35
memory: 32768
name: AI-Node
net0: virtio=XX:XX:XX:XX:XX:XX,bridge=vmbr0,tag=40
numa: 1
ostype: l26
scsi0: local-lvm:vm-106-disk-0,cache=writeback,iothread=1,size=300G,ssd=1
scsihw: virtio-scsi-single
sockets: 2

Key points:

Test Your VM

Start up your VM and check if the GPU initialized properly:

# Inside the Ubuntu VM, check the logs
sudo dmesg | grep -Ei "bios|gpu|amd|drm"

If everything worked, you should see something like:

[   28.319860] [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x02).
[   28.354277] amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from ROM BAR
[   28.354283] amdgpu: ATOM BIOS: 113-D1631700-111
[   28.361352] amdgpu 0000:05:00.0: amdgpu: MEM ECC is active.
[   28.361354] amdgpu 0000:05:00.0: amdgpu: SRAM ECC is active.
[   29.376346] [drm] Initialized amdgpu 3.57.0 20150101 for 0000:05:00.0 on minor 0

Part 4: Getting ROCm Working

After I got Ubuntu 22.04.5 running in the VM, I followed AMD's standard ROCm installation guide to get everything working for Ollama.

Reference: ROCm Quick Start Installation Guide

Install ROCm

# Download and install the amdgpu-install package
wget https://repo.radeon.com/amdgpu-install/6.4.3/ubuntu/jammy/amdgpu-install_6.4.60403-1_all.deb
sudo apt install ./amdgpu-install_6.4.60403-1_all.deb
sudo apt update

# Install some required Python packages
sudo apt install python3-setuptools python3-wheel

# Add your user to the right groups
sudo usermod -a -G render,video $LOGNAME

# Install ROCm
sudo apt install rocm

Install AMDGPU Kernel Module

# If you haven't already downloaded the installer
wget https://repo.radeon.com/amdgpu-install/6.4.3/ubuntu/jammy/amdgpu-install_6.4.60403-1_all.deb
sudo apt install ./amdgpu-install_6.4.60403-1_all.deb
sudo apt update

# Install kernel headers and the AMDGPU driver
sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
sudo apt install amdgpu-dkms

Post-Installation Setup

Following the ROCm Post-Install Guide:

# Set up library paths
sudo tee --append /etc/ld.so.conf.d/rocm.conf <<EOF
/opt/rocm/lib
/opt/rocm/lib64
EOF
sudo ldconfig

# Check ROCm installation
sudo update-alternatives --display rocm

# Set up environment variable
export LD_LIBRARY_PATH=/opt/rocm-6.4.3/lib

You want to reboot the VM after installing ROCm and the AMDGPU drivers.

Need to Remove Everything?

If you want to completely remove vendor-reset:

# Remove the DKMS module
sudo dkms remove vendor-reset/0.1.1 --all
sudo rm -rf /usr/src/vendor-reset-0.1.1
sudo rm -rf /var/lib/dkms/vendor-reset

# Remove configuration files
sudo sed -i '/vendor-reset/d' /etc/modules
sudo rm -f /etc/udev/rules.d/99-vendor-reset.rules

# Update initramfs and reboot
sudo update-initramfs -u -k all
reboot

Credits and References

Final Thoughts

This setup took me way longer to figure out than it should have. If this guide saves you some time and frustration, awesome! Feel free to contribute back with any improvements or issues you run into.