xnvmeperf#
xnvmeperf is a multi-threaded async I/O benchmark for NVMe devices. It
runs a time-bounded async I/O loop and reports per-device and aggregate IOPS
and throughput. A verify subcommand checks data integrity by writing a known
per-LBA pattern and reading it back.
When built with CUDA support (XNVME_BE_UPCIE_CUDA_ENABLED), two additional
subcommands are available: cuda-run and cuda-verify. These run NVMe I/O
directly from CUDA kernels via the uPCIe CUDA backend, bypassing
the host I/O path entirely.
Usage: xnvmeperf <command> [<args>]
Where <command> is one of:
run | Run a benchmark against the given devices
verify | Verify data integrity by writing and reading back a known pattern
cuda-run | Run a GPU benchmark against the given devices (requires upcie-cuda backend)
cuda-verify | Verify GPU NVMe I/O data integrity (requires upcie-cuda backend)
See 'xnvmeperf <command> --help' for the description of [<args>]
xnvmeperf - NVMe async IO benchmark -- ver: {major: 0, minor: 7, patch: 5}
run — Benchmark#
Runs async I/O against one or more NVMe devices for a fixed duration. Devices
are listed as trailing positional arguments. One thread is spawned per CPU in
--cpumask, e.g. 0x3 spawns two threads on CPUs 0 and 1. If there are
more devices than CPUs, devices are split evenly so each thread owns a slice and
runs one async job per owned device. If there are more CPUs than devices,
devices are assigned round-robin so multiple threads drive the same device
concurrently.
Usage: xnvmeperf run [<uri>...] [<args>]
Run a time-bounded async IO benchmark against one or more NVMe devices.
Devices are distributed across CPU threads pinned by --cpumask.
Positional arguments:
[uri ...] ; Device URI e.g. '/dev/nvme0n1', '0000:01:00.1', '10.9.8.1.8888', '\\.\PhysicalDrive1'
Where <args> include:
--iopattern STRING ; IO pattern (read, write, randread, randwrite)
[ --nqueues NUM ] ; Number of queues per device
--qdepth NUM ; Use given 'NUM' as queue max capacity
--iosize NUM ; Use given 'NUM' as bs/iosize
--runtime NUM ; Run for 'NUM' seconds
--cpumask STRING ; Hex CPU bitmask for thread pinning (e.g. 0x3)
With <args> for backend:
[ --be STRING ] ; xNVMe backend, e.g. 'linux', 'spdk', 'fbsd', 'macos', 'posix', 'windows'
[ --direct ] ; Bypass layers
[ --poll_io NUM ] ; For async=io_uring, enable hipri/io-compl.polling
[ --poll_sq NUM ] ; For async=io_uring, enable kernel-side sqthread-poll
[ --help ] ; Show usage / help
See 'xnvmeperf --help' for other commands
xnvmeperf - NVMe async IO benchmark -- ver: {major: 0, minor: 7, patch: 5}
Example — sequential read on a single device:
xnvmeperf run --iopattern read --qdepth 32 --iosize 4096 \
--runtime 10 --cpumask 0x1 /dev/nvme0n1
Example — random write across two devices on two CPUs:
xnvmeperf run --iopattern randwrite --qdepth 64 --iosize 4096 \
--runtime 10 --cpumask 0x3 /dev/nvme0n1 /dev/nvme1n1
verify — Data integrity check#
Writes --count sequential I/Os with a per-LBA stamp pattern, then reads
them back and compares against the expected data. Reports the number of
mismatches and I/O errors per device.
Usage: xnvmeperf verify [<uri>...] [<args>]
For each device: writes --count sequential IOs with a per-LBA pattern,
then reads them back and compares against the expected data.
Positional arguments:
[uri ...] ; Device URI e.g. '/dev/nvme0n1', '0000:01:00.1', '10.9.8.1.8888', '\\.\PhysicalDrive1'
Where <args> include:
--iosize NUM ; Use given 'NUM' as bs/iosize
--count NUM ; Use given 'NUM' as count
With <args> for backend:
[ --be STRING ] ; xNVMe backend, e.g. 'linux', 'spdk', 'fbsd', 'macos', 'posix', 'windows'
[ --direct ] ; Bypass layers
[ --poll_io NUM ] ; For async=io_uring, enable hipri/io-compl.polling
[ --poll_sq NUM ] ; For async=io_uring, enable kernel-side sqthread-poll
[ --help ] ; Show usage / help
See 'xnvmeperf --help' for other commands
xnvmeperf - NVMe async IO benchmark -- ver: {major: 0, minor: 7, patch: 5}
Example:
xnvmeperf verify --iosize 4096 --count 256 /dev/nvme0n1
cuda-run — GPU benchmark#
See also
- uPCIe CUDA
Backend setup, system configuration, and memory architecture.
- GPU-Resident Queue API
The
libxnvme_cudaAPI used to create queues and dispatch commands from CUDA kernels.
Requires the upcie-cuda backend. All queues across all devices are driven
by a single CUDA kernel: each CUDA block owns one NVMe queue and each thread
within the block owns one queue slot, so --qdepth threads submit and reap
commands in lock-step. The grid has ndevs × --queues blocks in total.
Both --qdepth and --iosize must be powers of 2. Supported patterns are
read, write, randread, and randwrite.
Usage: xnvmeperf cuda-run [<uri>...] [<args>]
Run a time-bounded GPU NVMe I/O benchmark against one or more devices.
All devices run in parallel: one CUDA block per queue, all queues launched
in a single kernel. All devices must use the same LBA size.
Positional arguments:
[uri ...] ; Device URI e.g. '/dev/nvme0n1', '0000:01:00.1', '10.9.8.1.8888', '\\.\PhysicalDrive1'
Where <args> include:
--iopattern STRING ; IO pattern (read, write, randread, randwrite)
[ --nqueues NUM ] ; Number of queues per device
--qdepth NUM ; Use given 'NUM' as queue max capacity
--iosize NUM ; Use given 'NUM' as bs/iosize
--runtime NUM ; Run for 'NUM' seconds
With <args> for backend:
[ --be STRING ] ; xNVMe backend, e.g. 'linux', 'spdk', 'fbsd', 'macos', 'posix', 'windows'
[ --help ] ; Show usage / help
See 'xnvmeperf --help' for other commands
xnvmeperf - NVMe async IO benchmark -- ver: {major: 0, minor: 7, patch: 5}
Example — sequential read, four queues of depth 32 on two devices:
xnvmeperf cuda-run --iopattern read --queues 4 --qdepth 32 --iosize 4096 \
--runtime 10 --be upcie-cuda 0000:01:00.0 0000:02:00.0
Example — random write, single queue:
xnvmeperf cuda-run --iopattern randwrite --qdepth 64 --iosize 4096 \
--runtime 10 --be upcie-cuda 0000:01:00.0
cuda-verify — GPU data integrity check#
Uses the same queue topology as cuda-run. For each queue slot the kernel
writes a unique LBA-stamped pattern, then reads it back on the host and
compares against the expected data. This is intended to confirm that
cuda-run results reflect correct I/O rather than silent data corruption.
Usage: xnvmeperf cuda-verify [<uri>...] [<args>]
Write an LBA-stamped pattern to each device through GPU queues and read
it back, verifying that the data matches. Uses the same queue topology
as cuda-run so results are directly comparable.
Positional arguments:
[uri ...] ; Device URI e.g. '/dev/nvme0n1', '0000:01:00.1', '10.9.8.1.8888', '\\.\PhysicalDrive1'
Where <args> include:
--iosize NUM ; Use given 'NUM' as bs/iosize
[ --nqueues NUM ] ; Number of queues per device
--qdepth NUM ; Use given 'NUM' as queue max capacity
With <args> for backend:
[ --be STRING ] ; xNVMe backend, e.g. 'linux', 'spdk', 'fbsd', 'macos', 'posix', 'windows'
[ --help ] ; Show usage / help
See 'xnvmeperf --help' for other commands
xnvmeperf - NVMe async IO benchmark -- ver: {major: 0, minor: 7, patch: 5}
Example:
xnvmeperf cuda-verify --iosize 4096 --qdepth 32 --be upcie-cuda 0000:01:00.0