uPCIe CUDA#

Warning

This backend requires a custom Linux kernel patch that is not yet upstream. See Kernel Requirement for details.

The upcie-cuda backend enables direct PCIe peer-to-peer (P2P) data transfers between the NVMe device and the GPU, with I/O buffers allocated from a heap backed by CUDA device memory.

Physical addresses for GPU memory are resolved once at initialization through the Linux dma-buf interface. The CUDA driver exports memory at 64 KiB granularity (one dma-buf page per 64 KiB of GPU memory), and these physical addresses are stored in a lookup table (LUT) indexed by GPU page. At submission time, PRP entries are built at 4 KiB granularity by computing sub-page offsets within each 64 KiB LUT entry. This matches the host page size used by NVMe for PRP construction and sector alignment.

Memory Architecture#

This backend uses a hybrid memory model:

Structure	Location	Reason
Data buffers (`xnvme_buf_alloc`)	GPU device memory (CUDA heap, 1 GiB)	Transferred directly by the NVMe controller via PCIe P2P, bypassing host DRAM
SQ, CQ, PRP lists	Host hugepage memory (host heap, 256 MiB)	The CPU writes and the NVMe controller DMA-reads these structures; host-accessible memory is required

The NVMe data path goes GPU ↔ NVMe without touching host DRAM. The control path (submission queue entries, completion queue entries, PRP lists) still flows through host memory. As a result, both the CUDA heap and the host hugepage runtime are initialized when the first upcie-cuda device is opened.

Kernel Requirement#

Physical address resolution for CUDA device memory relies on importing a dma-buf exported by the CUDA driver into the kernel’s DMA subsystem. This requires a Linux kernel patch that is not yet upstream:

xnvme/udmabuf-import

Build and install that patched kernel before using the upcie-cuda backend.

System Configuration#

Hardware Requirements#

P2P transfers require a GPU with a sufficiently large BAR1 window. BAR1 maps GPU device memory into the host PCIe address space and must be large enough to cover the CUDA heap (1 GiB by default). To check the available BAR1 size:

nvidia-smi -q -d memory

Look for the BAR1 Memory Usage section. If Total BAR1 is smaller than the heap size, initialization will fail.

Hugepages#

In addition to the CUDA heap, opening an upcie-cuda device also initializes the host hugepage runtime (256 MiB) used to hold NVMe control structures. Follow the hugepage setup steps in uPCIe (host memory) before opening an upcie-cuda device.

GPU-Resident Queue API#

The upcie-cuda backend supports GPU-resident NVMe queue pairs via the libxnvme_cuda API. See GPU-Resident Queue API for the full API reference, including host-side setup, CUDA kernel dispatch, and queue depth semantics.

Validation#

The cijoe workflow test-gpu.yaml exercises the upcie-cuda backend against a PCIe NVMe device. Devices must be tagged with the cuda label in your cijoe configuration:

[[devices]]
uri = "0000:xx:00.0"
nsid = 1
labels = ["dev", "pcie", "nvm", "cuda"]
driver_attachment = "userspace"

Run the workflow with:

cd cijoe && cijoe workflows/test-gpu.yaml --config configs/<your-config>.toml

Limitations#

GPU 0 only. The CUDA context and heap are always created on CUDA device 0. Multiple GPU support is not implemented.
1 GiB heap. The CUDA heap is fixed at 1 GiB. Allocations beyond this limit return ENOMEM.
No PRP list chaining. Each request has a single 4 KiB PRP list page (512 entries). Combined with PRP1, the maximum transfer size per command is 513 × 4 KiB ≈ 2 MiB.
No buf_realloc. Buffer reallocation is not implemented and returns ENOSYS.
No memory mapping (mem_map / mem_unmap). These return ENOSYS.
No pseudo commands. Show registers, controller reset, subsystem reset, and namespace rescan all return ENOSYS.