The rapid adoption of AI/ML workloads has driven explosive demand for GPU computing. Proxmox VE's GPU passthrough capability lets virtual machines access physical GPUs directly, delivering near-bare-metal performance for LLM inference, model training, and image generation. This article provides a detailed walkthrough from IOMMU/VFIO configuration to running AI workloads.
For running AI/ML infrastructure on Kubernetes, Kubo On-Premise provides a fully managed K8s environment including GPU node management, with automated GPU resource scheduling and operations.
GPU Passthrough Fundamentals
GPU passthrough is a virtualization technique that provides VMs with direct access to physical GPUs through VFIO (Virtual Function io), minimizing hypervisor overhead.
Understanding IOMMU
IOMMU (Input-Output Memory Management Unit) is the hardware component that translates device addresses, creating isolated "lanes" for each device. This is the prerequisite for passthrough.
- intel: vt-d (virtualization technology for directed i-o)
- AMD: AMD-Vi (AMD io Virtualization Technology)
Supported GPUs
| Category | GPU | VRAM | Recommended Use |
|---|---|---|---|
| Entry | RTX 3060 / RTX 4060 | 8-12 GB | Small-scale LLM inference |
| Mid-range | RTX 3080 / RTX 4060 Ti | 10-16 GB | Medium models, fine-tuning |
| High-end | RTX 3090 / RTX 4090 | 24 GB | Large LLMs, multi-model |
| Datacenter | NVIDIA P40 / A100 | 24-80 GB | Production AI workloads |
The Proxmox forum GPU passthrough guide notes that NVIDIA GPUs have the broadest support across AI stacks. AMD GPUs require ROCm but work with Ollama on supported distributions.
BIOS and Proxmox Host Configuration
BIOS Settings
Verify the following BIOS settings:
- Disable Secure Boot: Eliminates certificate registration requirements
- Enable VT-d (Intel) / IOMMU (AMD): Activate in the motherboard BIOS menus
- Update firmware to latest version: Improves passthrough compatibility
Kernel Parameters
Add IOMMU parameters to the GRUB command line:
# Edit /etc/default/grub
# For Intel CPUs
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
# For AMD CPUs
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"
# Update GRUB and reboot
update-grub
reboot
The iommu=pt (passthrough mode) tells the kernel to engage the IOMMU only for devices being passed through, improving overall performance.
After reboot, verify IOMMU is enabled:
dmesg | grep -e DMAR -e IOMMU
# Should display "DMAR: IOMMU enabled" or similar message
VFIO Module Configuration
Load VFIO modules and blacklist host GPU drivers:
# Add VFIO modules to /etc/modules
cat >> /etc/modules <<EOF
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
EOF
# Blacklist host GPU drivers
cat > /etc/modprobe.d/blacklist-gpu.conf <<EOF
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
blacklist snd_hda_intel
EOF
Binding the GPU to VFIO
Identify the GPU's PCI address and vendor IDs, then bind to the VFIO driver:
# Find GPU PCI address
lspci -v | grep -i nvidia
# Example: 01:00.0 VGA compatible controller: NVIDIA Corporation ...
# Example: 01:00.1 Audio device: NVIDIA Corporation ...
# Get vendor IDs
lspci -n -s 01:00
# Example: 01:00.0 0300: 10de:2684 (rev a1)
# Example: 01:00.1 0403: 10de:22ba (rev a1)
# Bind GPU to VFIO
cat > /etc/modprobe.d/vfio.conf <<EOF
options vfio-pci ids=10de:2684,10de:22ba disable_vga=1
softdep nvidia pre: vfio-pci
softdep snd_hda_intel pre: vfio-pci
EOF
# Update initramfs and reboot
update-initramfs -u -k all
reboot
Verify the binding after reboot:
lspci -nnk -s 01:00
# Should show "Kernel driver in use: vfio-pci"
With Kubo, integrating GPU nodes into Kubernetes and automating GPU resource scheduling via NVIDIA Device Plugin becomes straightforward.
Creating the VM and Attaching the GPU
VM Configuration
Create a VM optimized for GPU passthrough:
# Create VM via Web UI or CLI
qm create 200 --name ai-workstation --memory 32768 --cores 8 \
--machine q35 --bios ovmf \
--cpu host \
--scsihw virtio-scsi-single \
--net0 virtio,bridge=vmbr0
# Add EFI disk
qm set 200 --efidisk0 local-lvm:1,efitype=4m
# Add OS disk
qm set 200 --scsi0 local-lvm:100,ssd=1,discard=on
Critical settings:
| Setting | Value | Reason |
|---|---|---|
| Machine type | q35 | Required for PCIe passthrough |
| BIOS | OVMF (UEFI) | Proper PCIe device initialization |
| CPU type | host | Full CPU feature exposure |
| SCSI Controller | VirtIO SCSI | Maximum disk performance |
GPU Passthrough Configuration
Add the GPU device to the VM configuration:
# Add to /etc/pve/qemu-server/200.conf
# Pass through GPU (VGA + Audio)
hostpci0: 01:00,pcie=1,x-vga=on
# Or via Web UI:
# Hardware → Add → PCI Device → Select GPU
# Check "All Functions" and "PCI-Express"
Installing NVIDIA Drivers in the Guest
Install NVIDIA drivers and CUDA toolkit inside the VM:
# For Ubuntu 24.04
# Install NVIDIA driver
sudo apt update
sudo apt install -y nvidia-driver-560
# Install CUDA toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-6
# Verify GPU operation
nvidia-smi
Running AI/ML Workloads
LLM Inference with Ollama
Ollama is the easiest way to run LLMs in a GPU passthrough environment:
# Install Ollama
curl -fsSL https://ollama.com/install | sh
# Verify service status
systemctl status ollama
# Download and run a model
ollama pull llama3.1:70b
ollama run llama3.1:70b
# API access (default: port 11434)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:70b",
"prompt": "Explain GPU passthrough in Proxmox"
}'
High-Performance Inference with vLLM
For production environments requiring high throughput, vLLM is the optimal choice:
# Install vLLM
pip install vllm
# Start OpenAI-compatible API server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--port 8000
Docker + NVIDIA Container Toolkit
For container-based AI workloads, use the NVIDIA Container Toolkit:
# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# Add repository and install
sudo apt install -y nvidia-container-toolkit
# Run GPU-accelerated container
docker run --gpus all nvidia/cuda:12.6.0-base-ubuntu24.04 nvidia-smi
For running AI workloads at scale on Kubernetes, Kubo On-Premise provides a fully managed environment with automatic NVIDIA Device Plugin configuration, GPU resource quotas, and multi-tenant GPU sharing.
Troubleshooting
Common Issues and Solutions
| Issue | Cause | Solution |
|---|---|---|
| IOMMU groups not isolated | No ACS support | Add pcie_acs_override=downstream,multifunction to GRUB |
| Out of memory on VM start | PCIe memory pinning | Reduce VM RAM, configure hugepages |
| NVIDIA Code 43 error | Driver detects virtual environment | Verify cpu: host, use latest drivers |
| GPU reset issues | Common with AMD GPUs | Use vendor-reset kernel module |
| Audio device conflicts | GPU audio collision | Blacklist snd_hda_intel |
# Check IOMMU groups
for d in /sys/kernel/iommu_groups/*/devices/*; do
n=${d#*/iommu_groups/}; n=${n%%/*}
printf 'IOMMU Group %s ' "$n"
lspci -nns "${d##*/}"
done
The Proxmox Wiki PCI Passthrough guide contains detailed troubleshooting procedures.
Conclusion
Proxmox VE GPU passthrough delivers near-bare-metal GPU performance for AI/ML workloads when IOMMU and VFIO are properly configured. Modern hardware and driver maturity have significantly simplified the setup process.
For running GPU-accelerated AI workloads at production scale on Kubernetes, Kubo On-Premise is the optimal platform. Combine Proxmox GPU passthrough with Kubo's fully managed K8s to build scalable AI infrastructure.
For consultation on GPU passthrough environment design, contact us to discuss your requirements.
Related Links: