Monday, September 21, 2020

(Failing to) reconfigure host PCI or USB devices for guest kvm libvirt passthrough at runtime: how to unbind built-in kernel drivers from devices without a reboot

I recently had an NVMe drive fault and wanted to check for a firmware update. Western Digital (was SanDisk) in their infinite wisdom only offer firmware for their drives in the form of  ... a Windows management application.

How hard can it really be to run that under Linux?

Moderately, it turns out, and that's by using a Windows VM under libvirt-managed kvm to do it.

First I needed VT-d and the host IOMMU enabled:

  • ensure VT-d was enabled in UEFI firmware (it was). Check grep -oE 'svm|vmx' /proc/cpuinfo | uniq . If there's output, it's available.
  • Run sudo virt-host-validate to check kvm. The line "Checking if IOMMU is enabled by kernel" was marked with a warning.
  • Add intel_iommu=on to my kernel command line and reboot
  • Re-check virt-host-validate, which no longer complains

Then I needed to unbind my NVMe controller and the parent bus from the host so I could pass it through. 

Obviously you can only do this if you're booted off something that won't need the PCI devices you're unbinding. In my case I'm booted from a USB3 HDD.

 Identify the bus path to the NVMe device:

$ sudo lspci -t -nnn -vv -k 
-[0000:00]-+-00.0 Intel Corporation Device [8086:9b61]
+-02.0 Intel Corporation UHD Graphics [8086:9b41]
+-04.0 Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [8086:1903]
+-08.0 Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
+-12.0 Intel Corporation Comet Lake Thermal Subsytem [8086:02f9]
+-14.0 Intel Corporation Device [8086:02ed]
+-14.2 Intel Corporation Device [8086:02ef]
+-14.3 Intel Corporation Wireless-AC 9462 [8086:02f0]
+-16.0 Intel Corporation Comet Lake Management Engine Interface [8086:02e0]
+-17.0 Intel Corporation Comet Lake SATA AHCI Controller [8086:02d3]
+-1d.0-[04]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168]
+-1d.4-[07]----00.0 Sandisk Corp Device [15b7:5005]
+-1f.0 Intel Corporation Device [8086:0284]
+-1f.3 Intel Corporation Device [8086:02c8]
+-1f.4 Intel Corporation Device [8086:02a3]
\-1f.5 Intel Corporation Comet Lake SPI (flash) Controller [8086:02a4]

In this case it's device 15b7:5005 .

I originally tried to pass through just that device after unloading the nvme module, but it failed with errors like

vfio-pci 0000:03:00.0: not ready 65535ms after FLR; giving up

so I landed up having to unbind the parent on the bus too.

First, identify the kernel drivers used, bus IDs, and device IDs. In my case I'll be doing the NVMe SSD device 15b7:5005, the parent SATA AHCI controller 8086:02d3, and the PCIe ethernet controller 10ec:8168 that seems to be a child of the SATA controller.

Use lspci -k -d {{deviceid}} to see the kernel driver bound to each device, and any kernel module(s) registered as supporting it, e.g.:

# lspci -k -d 8086:02d3
00:17.0 SATA controller: Intel Corporation Comet Lake SATA AHCI Controller
    Subsystem: Lenovo Device 5079
    Kernel driver in use: ahci

# lspci -k -d 15b7:5005
07:00.0 Non-Volatile memory controller: Sandisk Corp Device 5005 (rev 01)
    Subsystem: Sandisk Corp Device 5005
    Kernel driver in use: nvme
    Kernel modules: nvme

If it's owned by a module you can often just unload the module to unbind it, e.g.

$ rmmod nvme

but if it's owned by a built-in driver like "ahci" is in my kernel, you can't do that. It doesn't show up in lsmod and cannot be rmmod'd.

Instead you need to use sysfs to unbind it. (You can do this for devices bound to modules too, which is handy if you need the module for something critical for the host OS).

To unbind the ahci driver from the controller on my host, for example, here's what I did:

# ls /sys/module/ahci/drivers/
pci:ahci
# ls "/sys/module/ahci/drivers/pci:ahci/"
0000:00:17.0  bind  module  new_id  remove_id  uevent  unbind


Note that '0000:00:17.0' matches the bus address we saw in lspci? Cool. Now unbind it:

# echo '0000:00:17.0' > "/sys/module/ahci/drivers/pci:ahci/"


Verify everything's unbound now:

# lspci -k -d 8086:02d3 
00:17.0 SATA controller: Intel Corporation Comet Lake SATA AHCI Controller
    Subsystem: Lenovo Device 507
# lspci -k -d 15b7:5005
07:00.0 Non-Volatile memory controller: Sandisk Corp Device 5005 (rev 01)
    Subsystem: Sandisk Corp Device 5005
    Kernel modules: nvme

Now bind it into the vfio-pci driver with:

# modprobe vfio-pci ids=8086:02d3,15b7:5005

Nowwith a bit of luck it can be attached to a kvm so it's accessible inside the guest. 

I used virt-manager for that because libvirt's semi-documented XML-based interface makes me want to scream. Just the guest, "add hardware", "PCI Device", and pick both the NVMe controller and the parent device. I didn't bother with the Ethernet controller, didn't seem to need it.

Sadly, it still won't work:

[ 2641.079391] vfio-pci 0000:07:00.0: vfio_ecap_init: hiding ecap 0x19@0x300
[ 2641.079395] vfio-pci 0000:07:00.0: vfio_ecap_init: hiding ecap 0x1e@0x900
[ 2641.109860] pcieport 0000:00:1d.4: DPC: containment event, status:0x1f11 source:0x0000
[ 2641.109863] pcieport 0000:00:1d.4: DPC: unmasked uncorrectable error detected
[ 2641.109867] pcieport 0000:00:1d.4: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 2641.109868] pcieport 0000:00:1d.4: AER:   device [8086:02b4] error status/mask=00200000/00010000
[ 2641.109869] pcieport 0000:00:1d.4: AER:    [21] ACSViol                (First)
[ 2642.319541] vfio-pci 0000:07:00.0: not ready 1023ms after FLR; waiting
[ 2643.407529] vfio-pci 0000:07:00.0: not ready 2047ms after FLR; waiting
[ 2645.519286] vfio-pci 0000:07:00.0: not ready 4095ms after FLR; waiting
[ 2650.063213] vfio-pci 0000:07:00.0: not ready 8191ms after FLR; waiting
[ 2658.767373] vfio-pci 0000:07:00.0: not ready 16383ms after FLR; waiting
[ 2675.663235] vfio-pci 0000:07:00.0: not ready 32767ms after FLR; waiting
[ 2712.526507] vfio-pci 0000:07:00.0: not ready 65535ms after FLR; giving up
[ 2713.766494] pcieport 0000:00:1d.4: AER: device recovery successful
[ 2714.435994] vfio-pci 0000:07:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 2764.090284] pcieport 0000:00:1d.4: DPC: containment event, status:0x1f15 source:0x0700
[ 2764.090286] pcieport 0000:00:1d.4: DPC: ERR_FATAL detected
[ 2764.254057] pcieport 0000:00:1d.4: AER: device recovery successful

... followed by a hang of the VM. But hey. It was worth a try.

DPC is "Downstream Port Containment" which is supposed to protect the host from failures in PCI devices by isolating them.

Since this later scrambled my host graphics and I had to force a reboot,

At least now you know how to unbind a driver from a device on the host without a reboot and without messing around with module blacklisting etc.

yay?