xnvmeperf#

xnvmeperf is a multi-threaded async I/O benchmark for NVMe devices. It runs a time-bounded async I/O loop and reports per-device and aggregate IOPS and throughput. A verify subcommand checks data integrity by writing a known per-LBA pattern and reading it back.

When built with CUDA support (XNVME_BE_UPCIE_CUDA_ENABLED), two additional subcommands are available: cuda-run and cuda-verify. These run NVMe I/O directly from CUDA kernels via the uPCIe CUDA backend, bypassing the host I/O path entirely.

Usage: xnvmeperf <command> [<args>]

Where <command> is one of:

  run              | Run a benchmark against the given devices
  verify           | Verify data integrity by writing and reading back a known pattern
  cuda-run         | Run a GPU benchmark against the given devices (requires upcie-cuda backend)
  cuda-verify      | Verify GPU NVMe I/O data integrity (requires upcie-cuda backend)

See 'xnvmeperf <command> --help' for the description of [<args>]

xnvmeperf - NVMe async IO benchmark -- ver: {major: 0, minor: 7, patch: 5}

run — Benchmark#

Runs async I/O against one or more NVMe devices for a fixed duration. Devices are listed as trailing positional arguments. One thread is spawned per CPU in --cpumask, e.g. 0x3 spawns two threads on CPUs 0 and 1. If there are more devices than CPUs, devices are split evenly so each thread owns a slice and runs one async job per owned device. If there are more CPUs than devices, devices are assigned round-robin so multiple threads drive the same device concurrently.

Usage: xnvmeperf run [<uri>...] [<args>]

Run a time-bounded async IO benchmark against one or more NVMe devices.
Devices are distributed across CPU threads pinned by --cpumask.
  
Positional arguments:

  [uri ...]                     ; Device URI e.g. '/dev/nvme0n1', '0000:01:00.1', '10.9.8.1.8888', '\\.\PhysicalDrive1'
  
Where <args> include:

  --iopattern STRING            ; IO pattern (read, write, randread, randwrite)
  [ --nqueues NUM ]             ; Number of queues per device
  --qdepth NUM                  ; Use given 'NUM' as queue max capacity
  --iosize NUM                  ; Use given 'NUM' as bs/iosize
  --runtime NUM                 ; Run for 'NUM' seconds
  --cpumask STRING              ; Hex CPU bitmask for thread pinning (e.g. 0x3)
  
With <args> for backend:

  [ --be STRING ]               ; xNVMe backend, e.g. 'linux', 'spdk', 'fbsd', 'macos', 'posix', 'windows'
  [ --direct ]                  ; Bypass layers
  [ --poll_io NUM ]             ; For async=io_uring, enable hipri/io-compl.polling
  [ --poll_sq NUM ]             ; For async=io_uring, enable kernel-side sqthread-poll
  [ --help ]                    ; Show usage / help

See 'xnvmeperf --help' for other commands

xnvmeperf - NVMe async IO benchmark -- ver: {major: 0, minor: 7, patch: 5}

Example — sequential read on a single device:

xnvmeperf run --iopattern read --qdepth 32 --iosize 4096 \
    --runtime 10 --cpumask 0x1 /dev/nvme0n1

Example — random write across two devices on two CPUs:

xnvmeperf run --iopattern randwrite --qdepth 64 --iosize 4096 \
    --runtime 10 --cpumask 0x3 /dev/nvme0n1 /dev/nvme1n1

verify — Data integrity check#

Writes --count sequential I/Os with a per-LBA stamp pattern, then reads them back and compares against the expected data. Reports the number of mismatches and I/O errors per device.

Usage: xnvmeperf verify [<uri>...] [<args>]

For each device: writes --count sequential IOs with a per-LBA pattern,
then reads them back and compares against the expected data.
  
Positional arguments:

  [uri ...]                     ; Device URI e.g. '/dev/nvme0n1', '0000:01:00.1', '10.9.8.1.8888', '\\.\PhysicalDrive1'
  
Where <args> include:

  --iosize NUM                  ; Use given 'NUM' as bs/iosize
  --count NUM                   ; Use given 'NUM' as count
  
With <args> for backend:

  [ --be STRING ]               ; xNVMe backend, e.g. 'linux', 'spdk', 'fbsd', 'macos', 'posix', 'windows'
  [ --direct ]                  ; Bypass layers
  [ --poll_io NUM ]             ; For async=io_uring, enable hipri/io-compl.polling
  [ --poll_sq NUM ]             ; For async=io_uring, enable kernel-side sqthread-poll
  [ --help ]                    ; Show usage / help

See 'xnvmeperf --help' for other commands

xnvmeperf - NVMe async IO benchmark -- ver: {major: 0, minor: 7, patch: 5}

Example:

xnvmeperf verify --iosize 4096 --count 256 /dev/nvme0n1

cuda-run — GPU benchmark#

See also

uPCIe CUDA

Backend setup, system configuration, and memory architecture.

GPU-Resident Queue API

The libxnvme_cuda API used to create queues and dispatch commands from CUDA kernels.

Requires the upcie-cuda backend. All queues across all devices are driven by a single CUDA kernel: each CUDA block owns one NVMe queue and each thread within the block owns one queue slot, so --qdepth threads submit and reap commands in lock-step. The grid has ndevs × --queues blocks in total.

Both --qdepth and --iosize must be powers of 2. Supported patterns are read, write, randread, and randwrite.

Usage: xnvmeperf cuda-run [<uri>...] [<args>]

Run a time-bounded GPU NVMe I/O benchmark against one or more devices.
All devices run in parallel: one CUDA block per queue, all queues launched
in a single kernel. All devices must use the same LBA size.
  
Positional arguments:

  [uri ...]                     ; Device URI e.g. '/dev/nvme0n1', '0000:01:00.1', '10.9.8.1.8888', '\\.\PhysicalDrive1'
  
Where <args> include:

  --iopattern STRING            ; IO pattern (read, write, randread, randwrite)
  [ --nqueues NUM ]             ; Number of queues per device
  --qdepth NUM                  ; Use given 'NUM' as queue max capacity
  --iosize NUM                  ; Use given 'NUM' as bs/iosize
  --runtime NUM                 ; Run for 'NUM' seconds
  
With <args> for backend:

  [ --be STRING ]               ; xNVMe backend, e.g. 'linux', 'spdk', 'fbsd', 'macos', 'posix', 'windows'
  [ --help ]                    ; Show usage / help

See 'xnvmeperf --help' for other commands

xnvmeperf - NVMe async IO benchmark -- ver: {major: 0, minor: 7, patch: 5}

Example — sequential read, four queues of depth 32 on two devices:

xnvmeperf cuda-run --iopattern read --queues 4 --qdepth 32 --iosize 4096 \
    --runtime 10 --be upcie-cuda 0000:01:00.0 0000:02:00.0

Example — random write, single queue:

xnvmeperf cuda-run --iopattern randwrite --qdepth 64 --iosize 4096 \
    --runtime 10 --be upcie-cuda 0000:01:00.0

cuda-verify — GPU data integrity check#

Uses the same queue topology as cuda-run. For each queue slot the kernel writes a unique LBA-stamped pattern, then reads it back on the host and compares against the expected data. This is intended to confirm that cuda-run results reflect correct I/O rather than silent data corruption.

Usage: xnvmeperf cuda-verify [<uri>...] [<args>]

Write an LBA-stamped pattern to each device through GPU queues and read
it back, verifying that the data matches. Uses the same queue topology
as cuda-run so results are directly comparable.
  
Positional arguments:

  [uri ...]                     ; Device URI e.g. '/dev/nvme0n1', '0000:01:00.1', '10.9.8.1.8888', '\\.\PhysicalDrive1'
  
Where <args> include:

  --iosize NUM                  ; Use given 'NUM' as bs/iosize
  [ --nqueues NUM ]             ; Number of queues per device
  --qdepth NUM                  ; Use given 'NUM' as queue max capacity
  
With <args> for backend:

  [ --be STRING ]               ; xNVMe backend, e.g. 'linux', 'spdk', 'fbsd', 'macos', 'posix', 'windows'
  [ --help ]                    ; Show usage / help

See 'xnvmeperf --help' for other commands

xnvmeperf - NVMe async IO benchmark -- ver: {major: 0, minor: 7, patch: 5}

Example:

xnvmeperf cuda-verify --iosize 4096 --qdepth 32 --be upcie-cuda 0000:01:00.0