System Config#

xNVMe relies on certain Operating System Kernel features and infrastructure that must be available and correctly configured. This subsection goes through what is used on Linux and how check whether is it available.

Backends#

The purpose of xNVMe backends are to provide an instrumental runtime supporting the xNVMe API in a single library with batteries included.

That is, it comes with the essential third-party libraries bundled into the xNVMe library. Thus, you get a single C API to program against and a single library to link with. And similarly for the command-line tools; a single binary to communicating with devices via the I/O stacks that available on the system.

To inspect the libraries which xNVMe is build against and the supported/enabled backends then invoke:

xnvme library-info

It should produce output similar to:

# xNVMe Library Information
ver: {major: 0, minor: 4, patch: 0}
xnvme_libconf:
  - '3p: fio;git-describe:fio-3.30'
  - '3p: liburing;git-describe:liburing-2.1-460-g4633a2d0'
  - '3p: spdk;git-describe:v21.10;+patches'
  - 'conf: XNVME_BE_LINUX_ENABLED'
  - 'conf: XNVME_BE_LINUX_BLOCK_ENABLED'
  - 'conf: XNVME_BE_LINUX_BLOCK_ZONED_ENABLED'
  - 'conf: XNVME_BE_LINUX_LIBAIO_ENABLED'
  - 'conf: XNVME_BE_LINUX_LIBURING_ENABLED'
  - 'conf: XNVME_BE_POSIX_ENABLED'
  - 'conf: XNVME_BE_SPDK_ENABLED'
  - 'conf: XNVME_BE_SPDK_TRANSPORT_PCIE_ENABLED'
  - 'conf: XNVME_BE_SPDK_TRANSPORT_TCP_ENABLED'
  - 'conf: XNVME_BE_SPDK_TRANSPORT_RDMA_ENABLED'
  - 'conf: XNVME_BE_SPDK_TRANSPORT_FC_ENABLED'
  - 'conf: XNVME_BE_ASYNC_ENABLED'
  - 'conf: XNVME_BE_ASYNC_EMU_ENABLED'
  - 'conf: XNVME_BE_ASYNC_THRPOOL_ENABLED'
  - '3p: linux;LINUX_VERSION_CODE-UAPI/330332-5.10.92'
  - '3p: NVME_IOCTL_IO64_CMD'
  - '3p: NVME_IOCTL_ADMIN64_CMD'
xnvme_be_attr_list:
  count: 5
  capacity: 5
  items:
  - name: 'spdk'
    enabled: 1

  - name: 'linux'
    enabled: 1

  - name: 'fbsd'
    enabled: 0

  - name: 'posix'
    enabled: 1

  - name: 'windows'
    enabled: 0

The xnvme_3p part of the output informs about the third-party projects which xNVMe was built against, and in the case of libraries, the version it has bundled.

Although a single API and a single library is provided by xNVMe, then runtime and system configuration dependencies remain. The following subsections describe how to instrument xNVMe to utilize the different kernel interfaces and user space drivers.

Kernel#

Linux Kernel version 5.9 or newer is currently preferred as it has all the features which xNVMe utilizes. This section also gives you a brief overview of the different I/O paths and APIs which the xNVMe API unifies access to.

NVMe Driver and IOCTLs#

The default for xNVMe is to communicate with devices via the operating system NVMe driver IOCTLs, specifically on Linux the following are used:

  • NVME_IOCTL_ID

  • NVME_IOCTL_IO_CMD

  • NVME_IOCTL_ADMIN_CMD

  • NVME_IOCTL_IO64_CMD

  • NVME_IOCTL_ADMIN64_CMD

In case the *64_CMD IOCTLs are not available then xNVMe falls back to using the non-64bit equivalents. The 64 vs 32 completion result mostly affect commands such as Zone Append. You can check that this interface is behaving as expected by running:

xnvme info /dev/nvme0n1

Which you yield output equivalent to:

xnvme_dev:
  xnvme_ident:
    uri: '/dev/nvme0n1'
    dtype: 0x2
    nsid: 0x1
    csi: 0x0
  xnvme_be:
    admin: {id: 'nvme'}
    sync: {id: 'nvme'}
    async: {id: 'emu'}

This tells you that xNVMe can communicate with the given device identifier and it informs you that it utilizes nvme_ioctl for synchronous command execution and it uses thr for asynchronous command execution. Since IOCTLs are inherently synchronous then xNVMe mimics asynchronous behavior over IOCTLs to support the asynchronous primitives provided by the xNVMe API.

Block Layer#

In case your device is not an NVMe device, then the NVMe IOCTLs won’t be available. xNVMe will then try to utilize the Linux Block Layer and treat a given block device as a NVMe device via shim-layer for NVMe admin commands such as identify and get-features.

A brief example of checking this:

# Create a NULL Block instance
modprobe null_blk nr_devices=1
# Open and query the NULL Block instance with xNVMe
xnvme info /dev/nullb0
# Remove the NULL Block instance
modprobe -r null_blk

Yielding:

xnvme_dev:
  xnvme_ident:
    uri: '/dev/nullb0'
    dtype: 0x3
    nsid: 0x1
    csi: 0x1f
  xnvme_be:
    admin: {id: 'block'}
    sync: {id: 'block'}
    async: {id: 'emu'}
    attr: {name: 'linux'}
  xnvme_opts:
    be: 'linux'
    mem: 'FIX-ID-VS-MIXIN-NAME'
    dev: 'FIX-ID-VS-MIXIN-NAME'
    admin: 'block'
    sync: 'block'
    async: 'emu'
    oflags: 0x4
  xnvme_geo:
    type: XNVME_GEO_CONVENTIONAL
    npugrp: 1
    npunit: 1
    nzone: 1
    nsect: 524288000
    nbytes: 512
    nbytes_oob: 0
    tbytes: 268435456000
    mdts_nbytes: 65024
    lba_nbytes: 512
    lba_extended: 0
    ssw: 9

Block Zoned IOCTLs#

Building on the Linux Block model, then the Zoned Block Device model is also utilized, specifically the following IOCTLs:

  • BLK_ZONE_REP_CAPACITY

  • BLKCLOSEZONE

  • BLKFINISHZONE

  • BLKOPENZONE

  • BLKRESETZONE

  • BLKGETNRZONES

  • BLKREPORTZONE

When available, then xNVMe can make use of the above IOCTLs. This is mostly useful when developing/testing using Linux Null Block devices. And similar for a Zoned NULL Block instance:

# Create a Zoned NULL Block instance
modprobe null_blk nr_devices=1 zoned=1
# Open and query the Zoned NULL Block instance with xNVMe
xnvme info /dev/nullb0
# Remove the Zoned NULL Block instance
modprobe -r null_blk

Yielding:

xnvme_dev:
  xnvme_ident:
    uri: '/dev/nullb0'
    dtype: 0x3
    nsid: 0x1
    csi: 0x2
  xnvme_be:
    admin: {id: 'block'}
    sync: {id: 'block'}
    async: {id: 'emu'}
    attr: {name: 'linux'}
  xnvme_opts:
    be: 'linux'
    mem: 'FIX-ID-VS-MIXIN-NAME'
    dev: 'FIX-ID-VS-MIXIN-NAME'
    admin: 'block'
    sync: 'block'
    async: 'emu'
    oflags: 0x4
  xnvme_geo:
    type: XNVME_GEO_ZONED
    npugrp: 1
    npunit: 1
    nzone: 1000
    nsect: 524288
    nbytes: 512
    nbytes_oob: 0
    tbytes: 268435456000
    mdts_nbytes: 65024
    lba_nbytes: 512
    lba_extended: 0
    ssw: 9

Async I/O via libaio#

When AIO is available then the NVMe NVM Commands for read and write are sent over the Linux AIO interface. Doing so improves command-throughput at higher queue-depths when compared to sending the command over via the NVMe driver ioctl().

One can explicitly tell xNVMe to utilize libaio for async I/O by encoding it in the device identifier, like so:

xnvme_io_async read /dev/nvme0n1 --slba 0x0 --qdepth 1 --async libaio

Yielding the output:

# Allocating and filling buf of nbytes: 4096
# Initializing queue and setting default callback function and arguments
# Read uri: '/dev/nvme0n1', qd: 1
xnvme_lba_range:
  slba: 0x0000000000000000
  elba: 0x0000000000000000
  naddrs: 1
  nbytes: 4096
  attr: { is_zones: 0, is_valid: 1}
wall-clock: {elapsed: 0.0000, mib: 0.00, mib_sec: 265.77}
# cb_args: {submitted: 1, completed: 1, ecount: 0}

Async. I/O via io_uring#

xNVMe utilizes the Linux io_uring interface, its support for feature-probing the io_uring interface and the io_uring opcodes:

  • IORING_OP_READ

  • IORING_OP_WRITE

When available, then xNVMe can send the NVMe NVM Commands for read and write via the Linux io_uring interface. Doing so improves command-throughput at all io-depths when compared to sending the command via NVMe Driver IOCTLs and libaio. It also leverages the io_uring interface to enabling I/O polling and kernel-side submission polling.

One can explicitly tell xNVMe to utilize io_uring for async I/O by encoding it in the device identifier, like so:

xnvme_io_async read /dev/nvme0n1 --slba 0x0 --qdepth 1 --async io_uring

Yielding the output:

# Allocating and filling buf of nbytes: 4096
# Initializing queue and setting default callback function and arguments
# Read uri: '/dev/nvme0n1', qd: 1
xnvme_lba_range:
  slba: 0x0000000000000000
  elba: 0x0000000000000000
  naddrs: 1
  nbytes: 4096
  attr: { is_zones: 0, is_valid: 1}
wall-clock: {elapsed: 0.0000, mib: 0.00, mib_sec: 325.17}
# cb_args: {submitted: 1, completed: 1, ecount: 0}

User Space#

Linux provides the Userspace I/O ( uio ) and Virtual Function I/O vfio frameworks to write user space I/ O drivers. Both interfaces work by binding a given device to an in-kernel stub-driver. The stub-driver in turn exposes device-memory and device-interrupts to user space. Thus enabling the implementation of device drivers entirely in user space.

Although Linux provides a capable NVMe Driver with flexible IOCTLs, then a user space NVMe driver serves those who seek the lowest possible per-command processing overhead or wants full control over NVMe command construction, including command-payloads.

Fortunately, you do not need to go and write an user space NVMe driver since a highly efficient, mature and well-maintained driver already exists. Namely, the NVMe driver provided by the Storage Platform Development Kit (SPDK).

Another great fortune is that xNVMe bundles the SPDK NVMe Driver with the xNVMe library. So, if you have built and installed xNVMe then the SPDK NVMe Driver is readily available to xNVMe.

The following subsections goes through a configuration checklist, then shows how to bind and unbind drivers, and lastly how to utilize non-devfs device identifiers by enumerating the system and inspecting a device.

Config#

What remains is checking your system configuration, enabling IOMMU for use by the vfio-pci driver, and possibly falling back to the uio_pci_generic driver in case vfio-pci is not working out. vfio is preferred as hardware support for IOMMU allows for isolation between devices.

  1. Verify that your CPU supports virtualization / VT-d and that it is enabled in your board BIOS.

  2. Enable your kernel for an intel CPU then provide the kernel option intel_iommu=on. If you have a non-Intel CPU then consult documentation on enabling VT-d / IOMMU for your CPU.

  3. Increase limits, open /etc/security/limits.conf and add:

*    soft memlock unlimited
*    hard memlock unlimited
root soft memlock unlimited
root hard memlock unlimited

Once you have gone through these steps, and rebooted, then this command:

dmesg | grep "DMAR: IOMMU"

Should output:

[    0.023467] DMAR: IOMMU enabled

And this command:

find /sys/kernel/iommu_groups/ -type l

Should have output similar to:

/sys/kernel/iommu_groups/7/devices/0000:01:00.0
/sys/kernel/iommu_groups/5/devices/0000:00:05.0
/sys/kernel/iommu_groups/3/devices/0000:00:03.0
/sys/kernel/iommu_groups/1/devices/0000:00:01.0
/sys/kernel/iommu_groups/8/devices/0000:03:00.0
/sys/kernel/iommu_groups/8/devices/0000:02:00.0
/sys/kernel/iommu_groups/6/devices/0000:00:1f.2
/sys/kernel/iommu_groups/6/devices/0000:00:1f.0
/sys/kernel/iommu_groups/6/devices/0000:00:1f.3
/sys/kernel/iommu_groups/4/devices/0000:00:04.0
/sys/kernel/iommu_groups/2/devices/0000:00:02.0
/sys/kernel/iommu_groups/0/devices/0000:00:00.0

Unbinding and binding#

With the system configured then you can use the xnvme-driver script to bind and unbind devices. The xnvme-driver script is a merge of the SPDK setup.sh script and its dependencies.

By running the command below 8GB of hugepages will be configured, the Kernel NVMe driver unbound, and vfio-pci bound to the device:

HUGEMEM=4096 xnvme-driver

The command above should produce output similar to:

0000:03:00.0 (1b36 0010): nvme -> vfio-pci
0000:00:02.0 (1af4 1001): Active mountpoints on /dev/vda, so not binding

To unbind from vfio-pci and back to the Kernel NVMe driver, then run:

xnvme-driver reset

Should output similar to:

0000:03:00.0 (1b36 0010): vfio-pci -> nvme
0000:00:02.0 (1af4 1001): Already using the virtio-pci driver

Device Identifiers#

Since the Kernel NVMe driver is unbound from the device, then the kernel no longer know that the PCIe device is an NVMe device, thus, it no longer lives in Linux devfs, that is, no longer available in /dev as e.g. /dev/nvme0n1.

Instead of the filepath in devfs, then you use PCI ids and xNVMe options.

As always, use the xnvme cli tool to enumerate devices:

xnvme enum
xnvme_enumeration:
  - {uri: '0000:03:00.0', dtype: 0x2, nsid: 0x1, csi: 0x0}
  - {uri: '0000:03:00.0', dtype: 0x2, nsid: 0x2, csi: 0x2}

Notice that multiple URIs using the same PCI id but with different xNVMe ?opts=<val>. This is provided as a means to tell xNVMe that you want to use the NVMe controller at 0000:03:00.0 and the namespace identified by nsid=1.

xnvme-driver
xnvme info 0000:03:00.0 --dev-nsid=1
0000:03:00.0 (1b36 0010): Already using the vfio-pci driver
0000:00:02.0 (1af4 1001): Active mountpoints on /dev/vda, so not binding


xnvme_dev:
  xnvme_ident:
    uri: '0000:03:00.0'
    dtype: 0x2
    nsid: 0x1
    csi: 0x0
  xnvme_be:
    admin: {id: 'nvme'}
    sync: {id: 'nvme'}
    async: {id: 'nvme'}
    attr: {name: 'spdk'}
  xnvme_opts:
    be: 'spdk'
    mem: 'FIX-ID-VS-MIXIN-NAME'
    dev: 'FIX-ID-VS-MIXIN-NAME'
    admin: 'nvme'
    sync: 'nvme'
    async: 'nvme'
    oflags: 0x4
  xnvme_geo:
    type: XNVME_GEO_CONVENTIONAL
    npugrp: 1
    npunit: 1
    nzone: 1
    nsect: 2097152
    nbytes: 4096
    nbytes_oob: 0
    tbytes: 8589934592
    mdts_nbytes: 524288
    lba_nbytes: 4096
    lba_extended: 0
    ssw: 12

Similarly, when using the API, then you would use these URIs instead of filepaths:

...
struct xnvme_dev *dev = xnvme_dev_open("pci:0000:01:00.0?nsid=1");
...

Windows Kernel#

Windows 10 or later version is currently preferred as it has all the features which xNVMe utilizes. This section also gives you a brief overview of the different I/O paths and APIs which the xNVMe API unifies access to.

NVMe Driver and IOCTLs#

The default for xNVMe is to communicate with devices via the operating system NVMe driver IOCTLs, specifically on Windows the following are used:

  • IOCTL_STORAGE_QUERY_PROPERTY

  • IOCTL_STORAGE_SET_PROPERTY

  • IOCTL_STORAGE_REINITIALIZE_MEDIA

  • IOCTL_SCSI_PASS_THROUGH_DIRECT

You can check that this interface is behaving as expected by running:

xnvme.exe info \\.\PhysicalDrive0

Which should yield output equivalent to:

xnvme_dev:
  xnvme_ident:
    uri: '\\.\PhysicalDrive0'
    dtype: 0x2
    nsid: 0x1
    csi: 0x0
    subnqn: 'nqn.1994-11.com.samsung:nvme:980M.2:S649NL0T973010L     '
  xnvme_be:
    admin: {id: 'nvme'}
    sync: {id: 'nvme'}
    async: {id: 'iocp'}
    attr: {name: 'windows'}

This tells you that xNVMe can communicate with the given device identifier and it informs you that it utilizes nvme_ioctl for synchronous command execution and it uses iocp for asynchronous command execution. This method can be used for raw devices via \.PhysicalDrive<disk number> device path.

Below mentioned commands are currently supported by xNVMe using IOCTL path:

  • Admin Commands
    • Get Log Page

    • Identify

    • Get Feature

    • Format NVM

  • I/O Commands
    • Read

    • Write

NVMe Driver and Regular File#

xNVMe can communicate with File System mounted devices via the operating system generic APIs like ReadFile and WriteFile operations. This method can be used to do operation on Regular Files.

You can check that this interface is behaving as expected by running:

xnvme.exe info C:\README.md

Which should yield output equivalent to:

xnvme_dev:
  xnvme_ident:
    uri: 'C:\README.md'
    dtype: 0x4
    nsid: 0x1
    csi: 0x1f
    subnqn: ''
  xnvme_be:
    admin: {id: 'file'}
    sync: {id: 'file'}
    async: {id: 'iocp'}
    attr: {name: 'windows'}

This tells you that xNVMe can communicate with the given regular file and it informs you that it utilizes nvme_ioctl for synchronous command execution and it uses iocp for asynchronous command execution. This method can be used for file operations via <driver name>:<file name> path.

Async I/O via iocp#

When AIO is available then the NVMe NVM Commands for read and write are sent over the Windows IOCP interface. Doing so improves command-throughput at higher queue-depths when compared to sending the command via the NVMe driver ioctl().

One can explicitly tell xNVMe to utilize iocp for async I/O by encoding it in the device identifier, like so:

xnvme_io_async read \\.\PhysicalDrive0 --slba 0x0 --qdepth 1 --async iocp

Yielding the output:

# Allocating and filling buf of nbytes: 512
# Initializing queue and setting default callback function and arguments
# Read uri: '\\.\PhysicalDrive0', qd: 1
xnvme_lba_range:
  slba: 0x0000000000000000
  elba: 0x0000000000000000
  naddrs: 1
  nbytes: 512
  attr: { is_zones: 0, is_valid: 1}
wall-clock: {elapsed: 0.0002, mib: 0.00, mib_sec: 2.08}
# cb_args: {submitted: 1, completed: 1, ecount: 0}

Async I/O via iocp_th#

Similar to iocp interface, only difference is separate poller is used to fetch the completed IOs.

One can explicitly tell xNVMe to utilize iocp_th for async I/O by encoding it in the device identifier, like so:

xnvme_io_async read \\.\PhysicalDrive0 --slba 0x0 --qdepth 1 --async iocp_th

Yielding the output:

# Allocating and filling buf of nbytes: 512
# Initializing queue and setting default callback function and arguments
# Read uri: '\\.\PhysicalDrive0', qd: 1
xnvme_lba_range:
  slba: 0x0000000000000000
  elba: 0x0000000000000000
  naddrs: 1
  nbytes: 512
  attr: { is_zones: 0, is_valid: 1}
wall-clock: {elapsed: 0.0002, mib: 0.00, mib_sec: 2.14}
# cb_args: {submitted: 1, completed: 1, ecount: 0}

Async I/O via io_ring#

xNVMe utilizes the Windows io_ring interface, its support for feature-probing the io_ring interface and the io_ring opcodes:

When available, then xNVMe can send the io_ring specific request using IORING_HANDLE_REF and IORING_BUFFER_REF structure for read and write via Windows io_ring interface. Doing so improves command-throughput at all io-depths when compared to sending the command via NVMe Driver IOCTLs.

One can explicitly tell xNVMe to utilize io_ring for async I/O by encoding it in the device identifier, like so:

xnvme_io_async read \\.\PhysicalDrive0 --slba 0x0 --qdepth 1 --async io_ring

Yielding the output:

# Allocating and filling buf of nbytes: 512
# Initializing queue and setting default callback function and arguments
# Read uri: '\\.\PhysicalDrive0', qd: 1
xnvme_lba_range:
  slba: 0x0000000000000000
  elba: 0x0000000000000000
  naddrs: 1
  nbytes: 512
  attr: { is_zones: 0, is_valid: 1}
wall-clock: {elapsed: 0.0003, mib: 0.00, mib_sec: 1.92}
# cb_args: {submitted: 1, completed: 1, ecount: 0}