System Config#
xNVMe relies on certain Operating System Kernel features and infrastructure that must be available and correctly configured. This subsection goes through what is used on Linux and how check whether is it available.
Backends#
The purpose of xNVMe backends are to provide an instrumental runtime supporting the xNVMe API in a single library with batteries included.
That is, it comes with the essential third-party libraries bundled into the xNVMe library. Thus, you get a single C API to program against and a single library to link with. And similarly for the command-line tools; a single binary to communicating with devices via the I/O stacks that available on the system.
To inspect the libraries which xNVMe is build against and the supported/enabled backends then invoke:
xnvme library-info
It should produce output similar to:
# xNVMe Library Information
ver: {major: 0, minor: 4, patch: 0}
xnvme_libconf:
- '3p: fio;git-describe:fio-3.30'
- '3p: liburing;git-describe:liburing-2.1-460-g4633a2d0'
- '3p: spdk;git-describe:v21.10;+patches'
- 'conf: XNVME_BE_LINUX_ENABLED'
- 'conf: XNVME_BE_LINUX_BLOCK_ENABLED'
- 'conf: XNVME_BE_LINUX_BLOCK_ZONED_ENABLED'
- 'conf: XNVME_BE_LINUX_LIBAIO_ENABLED'
- 'conf: XNVME_BE_LINUX_LIBURING_ENABLED'
- 'conf: XNVME_BE_POSIX_ENABLED'
- 'conf: XNVME_BE_SPDK_ENABLED'
- 'conf: XNVME_BE_SPDK_TRANSPORT_PCIE_ENABLED'
- 'conf: XNVME_BE_SPDK_TRANSPORT_TCP_ENABLED'
- 'conf: XNVME_BE_SPDK_TRANSPORT_RDMA_ENABLED'
- 'conf: XNVME_BE_SPDK_TRANSPORT_FC_ENABLED'
- 'conf: XNVME_BE_ASYNC_ENABLED'
- 'conf: XNVME_BE_ASYNC_EMU_ENABLED'
- 'conf: XNVME_BE_ASYNC_THRPOOL_ENABLED'
- '3p: linux;LINUX_VERSION_CODE-UAPI/330332-5.10.92'
- '3p: NVME_IOCTL_IO64_CMD'
- '3p: NVME_IOCTL_ADMIN64_CMD'
xnvme_be_attr_list:
count: 5
capacity: 5
items:
- name: 'spdk'
enabled: 1
- name: 'linux'
enabled: 1
- name: 'fbsd'
enabled: 0
- name: 'posix'
enabled: 1
- name: 'windows'
enabled: 0
The xnvme_3p
part of the output informs about the third-party projects
which xNVMe was built against, and in the case of libraries, the version it
has bundled.
Although a single API and a single library is provided by xNVMe, then runtime and system configuration dependencies remain. The following subsections describe how to instrument xNVMe to utilize the different kernel interfaces and user space drivers.
Kernel#
Linux Kernel version 5.9 or newer is currently preferred as it has all the features which xNVMe utilizes. This section also gives you a brief overview of the different I/O paths and APIs which the xNVMe API unifies access to.
NVMe Driver and IOCTLs#
The default for xNVMe is to communicate with devices via the operating system NVMe driver IOCTLs, specifically on Linux the following are used:
NVME_IOCTL_ID
NVME_IOCTL_IO_CMD
NVME_IOCTL_ADMIN_CMD
NVME_IOCTL_IO64_CMD
NVME_IOCTL_ADMIN64_CMD
In case the *64_CMD
IOCTLs are not available then xNVMe falls back to
using the non-64bit equivalents. The 64 vs 32 completion result mostly affect
commands such as Zone Append. You can check that this interface is behaving as
expected by running:
xnvme info /dev/nvme0n1
Which you yield output equivalent to:
xnvme_dev:
xnvme_ident:
uri: '/dev/nvme0n1'
dtype: 0x2
nsid: 0x1
csi: 0x0
xnvme_be:
admin: {id: 'nvme'}
sync: {id: 'nvme'}
async: {id: 'emu'}
This tells you that xNVMe can communicate with the given device identifier and it informs you that it utilizes nvme_ioctl for synchronous command execution and it uses thr for asynchronous command execution. Since IOCTLs are inherently synchronous then xNVMe mimics asynchronous behavior over IOCTLs to support the asynchronous primitives provided by the xNVMe API.
Block Layer#
In case your device is not an NVMe device, then the NVMe IOCTLs won’t be available. xNVMe will then try to utilize the Linux Block Layer and treat a given block device as a NVMe device via shim-layer for NVMe admin commands such as identify and get-features.
A brief example of checking this:
# Create a NULL Block instance
modprobe null_blk nr_devices=1
# Open and query the NULL Block instance with xNVMe
xnvme info /dev/nullb0
# Remove the NULL Block instance
modprobe -r null_blk
Yielding:
xnvme_dev:
xnvme_ident:
uri: '/dev/nullb0'
dtype: 0x3
nsid: 0x1
csi: 0x1f
xnvme_be:
admin: {id: 'block'}
sync: {id: 'block'}
async: {id: 'emu'}
attr: {name: 'linux'}
xnvme_opts:
be: 'linux'
mem: 'FIX-ID-VS-MIXIN-NAME'
dev: 'FIX-ID-VS-MIXIN-NAME'
admin: 'block'
sync: 'block'
async: 'emu'
oflags: 0x4
xnvme_geo:
type: XNVME_GEO_CONVENTIONAL
npugrp: 1
npunit: 1
nzone: 1
nsect: 524288000
nbytes: 512
nbytes_oob: 0
tbytes: 268435456000
mdts_nbytes: 65024
lba_nbytes: 512
lba_extended: 0
ssw: 9
Block Zoned IOCTLs#
Building on the Linux Block model, then the Zoned Block Device model is also utilized, specifically the following IOCTLs:
BLK_ZONE_REP_CAPACITY
BLKCLOSEZONE
BLKFINISHZONE
BLKOPENZONE
BLKRESETZONE
BLKGETNRZONES
BLKREPORTZONE
When available, then xNVMe can make use of the above IOCTLs. This is mostly useful when developing/testing using Linux Null Block devices. And similar for a Zoned NULL Block instance:
# Create a Zoned NULL Block instance
modprobe null_blk nr_devices=1 zoned=1
# Open and query the Zoned NULL Block instance with xNVMe
xnvme info /dev/nullb0
# Remove the Zoned NULL Block instance
modprobe -r null_blk
Yielding:
xnvme_dev:
xnvme_ident:
uri: '/dev/nullb0'
dtype: 0x3
nsid: 0x1
csi: 0x2
xnvme_be:
admin: {id: 'block'}
sync: {id: 'block'}
async: {id: 'emu'}
attr: {name: 'linux'}
xnvme_opts:
be: 'linux'
mem: 'FIX-ID-VS-MIXIN-NAME'
dev: 'FIX-ID-VS-MIXIN-NAME'
admin: 'block'
sync: 'block'
async: 'emu'
oflags: 0x4
xnvme_geo:
type: XNVME_GEO_ZONED
npugrp: 1
npunit: 1
nzone: 1000
nsect: 524288
nbytes: 512
nbytes_oob: 0
tbytes: 268435456000
mdts_nbytes: 65024
lba_nbytes: 512
lba_extended: 0
ssw: 9
Async I/O via libaio
#
When AIO is available then the NVMe NVM Commands for read and write are sent over the Linux AIO interface. Doing so improves command-throughput at higher queue-depths when compared to sending the command over via the NVMe driver ioctl().
One can explicitly tell xNVMe to utilize libaio
for async I/O by
encoding it in the device identifier, like so:
xnvme_io_async read /dev/nvme0n1 --slba 0x0 --qdepth 1 --async libaio
Yielding the output:
# Allocating and filling buf of nbytes: 4096
# Initializing queue and setting default callback function and arguments
# Read uri: '/dev/nvme0n1', qd: 1
xnvme_lba_range:
slba: 0x0000000000000000
elba: 0x0000000000000000
naddrs: 1
nbytes: 4096
attr: { is_zones: 0, is_valid: 1}
wall-clock: {elapsed: 0.0000, mib: 0.00, mib_sec: 265.77}
# cb_args: {submitted: 1, completed: 1, ecount: 0}
Async. I/O via io_uring
#
xNVMe utilizes the Linux io_uring interface, its support for feature-probing the io_uring interface and the io_uring opcodes:
IORING_OP_READ
IORING_OP_WRITE
When available, then xNVMe can send the NVMe NVM Commands for read and write via the Linux io_uring interface. Doing so improves command-throughput at all io-depths when compared to sending the command via NVMe Driver IOCTLs and libaio. It also leverages the io_uring interface to enabling I/O polling and kernel-side submission polling.
One can explicitly tell xNVMe to utilize io_uring
for async I/O by
encoding it in the device identifier, like so:
xnvme_io_async read /dev/nvme0n1 --slba 0x0 --qdepth 1 --async io_uring
Yielding the output:
# Allocating and filling buf of nbytes: 4096
# Initializing queue and setting default callback function and arguments
# Read uri: '/dev/nvme0n1', qd: 1
xnvme_lba_range:
slba: 0x0000000000000000
elba: 0x0000000000000000
naddrs: 1
nbytes: 4096
attr: { is_zones: 0, is_valid: 1}
wall-clock: {elapsed: 0.0000, mib: 0.00, mib_sec: 325.17}
# cb_args: {submitted: 1, completed: 1, ecount: 0}
User Space#
Linux provides the Userspace I/O ( uio ) and Virtual Function I/O vfio frameworks to write user space I/ O drivers. Both interfaces work by binding a given device to an in-kernel stub-driver. The stub-driver in turn exposes device-memory and device-interrupts to user space. Thus enabling the implementation of device drivers entirely in user space.
Although Linux provides a capable NVMe Driver with flexible IOCTLs, then a user space NVMe driver serves those who seek the lowest possible per-command processing overhead or wants full control over NVMe command construction, including command-payloads.
Fortunately, you do not need to go and write an user space NVMe driver since a highly efficient, mature and well-maintained driver already exists. Namely, the NVMe driver provided by the Storage Platform Development Kit (SPDK).
Another great fortune is that xNVMe bundles the SPDK NVMe Driver with the xNVMe library. So, if you have built and installed xNVMe then the SPDK NVMe Driver is readily available to xNVMe.
The following subsections goes through a configuration checklist, then shows how to bind and unbind drivers, and lastly how to utilize non-devfs device identifiers by enumerating the system and inspecting a device.
Config#
What remains is checking your system configuration, enabling IOMMU for use by
the vfio-pci
driver, and possibly falling back to the uio_pci_generic
driver in case vfio-pci
is not working out. vfio
is preferred as
hardware support for IOMMU allows for isolation between devices.
Verify that your CPU supports virtualization / VT-d and that it is enabled in your board BIOS.
Enable your kernel for an intel CPU then provide the kernel option
intel_iommu=on
. If you have a non-Intel CPU then consult documentation on enabling VT-d / IOMMU for your CPU.Increase limits, open
/etc/security/limits.conf
and add:
* soft memlock unlimited
* hard memlock unlimited
root soft memlock unlimited
root hard memlock unlimited
Once you have gone through these steps, and rebooted, then this command:
dmesg | grep "DMAR: IOMMU"
Should output:
[ 0.023467] DMAR: IOMMU enabled
And this command:
find /sys/kernel/iommu_groups/ -type l
Should have output similar to:
/sys/kernel/iommu_groups/7/devices/0000:01:00.0
/sys/kernel/iommu_groups/5/devices/0000:00:05.0
/sys/kernel/iommu_groups/3/devices/0000:00:03.0
/sys/kernel/iommu_groups/1/devices/0000:00:01.0
/sys/kernel/iommu_groups/8/devices/0000:03:00.0
/sys/kernel/iommu_groups/8/devices/0000:02:00.0
/sys/kernel/iommu_groups/6/devices/0000:00:1f.2
/sys/kernel/iommu_groups/6/devices/0000:00:1f.0
/sys/kernel/iommu_groups/6/devices/0000:00:1f.3
/sys/kernel/iommu_groups/4/devices/0000:00:04.0
/sys/kernel/iommu_groups/2/devices/0000:00:02.0
/sys/kernel/iommu_groups/0/devices/0000:00:00.0
Unbinding and binding#
With the system configured then you can use the xnvme-driver
script to bind
and unbind devices. The xnvme-driver
script is a merge of the SPDK
setup.sh
script and its dependencies.
By running the command below 8GB of hugepages will be configured, the
Kernel NVMe driver unbound, and vfio-pci
bound to the device:
HUGEMEM=4096 xnvme-driver
The command above should produce output similar to:
0000:03:00.0 (1b36 0010): nvme -> vfio-pci
0000:00:02.0 (1af4 1001): Active mountpoints on /dev/vda, so not binding
To unbind from vfio-pci
and back to the Kernel NVMe driver, then run:
xnvme-driver reset
Should output similar to:
0000:03:00.0 (1b36 0010): vfio-pci -> nvme
0000:00:02.0 (1af4 1001): Already using the virtio-pci driver
Device Identifiers#
Since the Kernel NVMe driver is unbound from the device, then the kernel no
longer know that the PCIe device is an NVMe device, thus, it no longer lives in
Linux devfs, that is, no longer available in /dev
as e.g. /dev/nvme0n1
.
Instead of the filepath in devfs, then you use PCI ids and xNVMe options.
As always, use the xnvme
cli tool to enumerate devices:
xnvme enum
xnvme_enumeration:
- {uri: '0000:03:00.0', dtype: 0x2, nsid: 0x1, csi: 0x0}
- {uri: '0000:03:00.0', dtype: 0x2, nsid: 0x2, csi: 0x2}
Notice that multiple URIs using the same PCI id but with different xNVMe
?opts=<val>
. This is provided as a means to tell xNVMe that you want to
use the NVMe controller at 0000:03:00.0
and the namespace identified by
nsid=1
.
xnvme-driver
xnvme info 0000:03:00.0 --dev-nsid=1
0000:03:00.0 (1b36 0010): Already using the vfio-pci driver
0000:00:02.0 (1af4 1001): Active mountpoints on /dev/vda, so not binding
xnvme_dev:
xnvme_ident:
uri: '0000:03:00.0'
dtype: 0x2
nsid: 0x1
csi: 0x0
xnvme_be:
admin: {id: 'nvme'}
sync: {id: 'nvme'}
async: {id: 'nvme'}
attr: {name: 'spdk'}
xnvme_opts:
be: 'spdk'
mem: 'FIX-ID-VS-MIXIN-NAME'
dev: 'FIX-ID-VS-MIXIN-NAME'
admin: 'nvme'
sync: 'nvme'
async: 'nvme'
oflags: 0x4
xnvme_geo:
type: XNVME_GEO_CONVENTIONAL
npugrp: 1
npunit: 1
nzone: 1
nsect: 2097152
nbytes: 4096
nbytes_oob: 0
tbytes: 8589934592
mdts_nbytes: 524288
lba_nbytes: 4096
lba_extended: 0
ssw: 12
Similarly, when using the API, then you would use these URIs instead of filepaths:
...
struct xnvme_dev *dev = xnvme_dev_open("pci:0000:01:00.0?nsid=1");
...
Windows Kernel#
Windows 10 or later version is currently preferred as it has all the features which xNVMe utilizes. This section also gives you a brief overview of the different I/O paths and APIs which the xNVMe API unifies access to.
NVMe Driver and IOCTLs#
The default for xNVMe is to communicate with devices via the operating system NVMe driver IOCTLs, specifically on Windows the following are used:
IOCTL_STORAGE_QUERY_PROPERTY
IOCTL_STORAGE_SET_PROPERTY
IOCTL_STORAGE_REINITIALIZE_MEDIA
IOCTL_SCSI_PASS_THROUGH_DIRECT
You can check that this interface is behaving as expected by running:
xnvme.exe info \\.\PhysicalDrive0
Which should yield output equivalent to:
xnvme_dev:
xnvme_ident:
uri: '\\.\PhysicalDrive0'
dtype: 0x2
nsid: 0x1
csi: 0x0
subnqn: 'nqn.1994-11.com.samsung:nvme:980M.2:S649NL0T973010L '
xnvme_be:
admin: {id: 'nvme'}
sync: {id: 'nvme'}
async: {id: 'iocp'}
attr: {name: 'windows'}
This tells you that xNVMe can communicate with the given device identifier and it informs you that it utilizes nvme_ioctl for synchronous command execution and it uses iocp for asynchronous command execution. This method can be used for raw devices via \.PhysicalDrive<disk number> device path.
Below mentioned commands are currently supported by xNVMe using IOCTL path:
Admin Commands
Get Log Page
Identify
Get Feature
Format NVM
I/O Commands
Read
Write
NVMe Driver and Regular File#
xNVMe can communicate with File System mounted devices via the operating system generic APIs like ReadFile and WriteFile operations. This method can be used to do operation on Regular Files.
You can check that this interface is behaving as expected by running:
xnvme.exe info C:\README.md
Which should yield output equivalent to:
xnvme_dev:
xnvme_ident:
uri: 'C:\README.md'
dtype: 0x4
nsid: 0x1
csi: 0x1f
subnqn: ''
xnvme_be:
admin: {id: 'file'}
sync: {id: 'file'}
async: {id: 'iocp'}
attr: {name: 'windows'}
This tells you that xNVMe can communicate with the given regular file and it informs you that it utilizes nvme_ioctl for synchronous command execution and it uses iocp for asynchronous command execution. This method can be used for file operations via <driver name>:<file name> path.
Async I/O via iocp
#
When AIO is available then the NVMe NVM Commands for read and write are sent over the Windows IOCP interface. Doing so improves command-throughput at higher queue-depths when compared to sending the command via the NVMe driver ioctl().
One can explicitly tell xNVMe to utilize iocp
for async I/O by
encoding it in the device identifier, like so:
xnvme_io_async read \\.\PhysicalDrive0 --slba 0x0 --qdepth 1 --async iocp
Yielding the output:
# Allocating and filling buf of nbytes: 512
# Initializing queue and setting default callback function and arguments
# Read uri: '\\.\PhysicalDrive0', qd: 1
xnvme_lba_range:
slba: 0x0000000000000000
elba: 0x0000000000000000
naddrs: 1
nbytes: 512
attr: { is_zones: 0, is_valid: 1}
wall-clock: {elapsed: 0.0002, mib: 0.00, mib_sec: 2.08}
# cb_args: {submitted: 1, completed: 1, ecount: 0}
Async I/O via iocp_th
#
Similar to iocp
interface, only difference is separate poller is used to
fetch the completed IOs.
One can explicitly tell xNVMe to utilize iocp_th
for async I/O by
encoding it in the device identifier, like so:
xnvme_io_async read \\.\PhysicalDrive0 --slba 0x0 --qdepth 1 --async iocp_th
Yielding the output:
# Allocating and filling buf of nbytes: 512
# Initializing queue and setting default callback function and arguments
# Read uri: '\\.\PhysicalDrive0', qd: 1
xnvme_lba_range:
slba: 0x0000000000000000
elba: 0x0000000000000000
naddrs: 1
nbytes: 512
attr: { is_zones: 0, is_valid: 1}
wall-clock: {elapsed: 0.0002, mib: 0.00, mib_sec: 2.14}
# cb_args: {submitted: 1, completed: 1, ecount: 0}
Async I/O via io_ring
#
xNVMe utilizes the Windows io_ring interface, its support for feature-probing the io_ring interface and the io_ring opcodes:
When available, then xNVMe can send the io_ring specific request using IORING_HANDLE_REF and IORING_BUFFER_REF structure for read and write via Windows io_ring interface. Doing so improves command-throughput at all io-depths when compared to sending the command via NVMe Driver IOCTLs.
One can explicitly tell xNVMe to utilize io_ring
for async I/O by
encoding it in the device identifier, like so:
xnvme_io_async read \\.\PhysicalDrive0 --slba 0x0 --qdepth 1 --async io_ring
Yielding the output:
# Allocating and filling buf of nbytes: 512
# Initializing queue and setting default callback function and arguments
# Read uri: '\\.\PhysicalDrive0', qd: 1
xnvme_lba_range:
slba: 0x0000000000000000
elba: 0x0000000000000000
naddrs: 1
nbytes: 512
attr: { is_zones: 0, is_valid: 1}
wall-clock: {elapsed: 0.0003, mib: 0.00, mib_sec: 1.92}
# cb_args: {submitted: 1, completed: 1, ecount: 0}