uPCIe CUDA#
The upcie-cuda backend enables direct PCIe peer-to-peer (P2P) data transfers between the NVMe device and the GPU, with I/O buffers allocated from a heap backed by CUDA device memory.
Physical addresses for GPU memory are resolved once at initialization through
the Linux dma-buf interface. The CUDA driver exports memory at 64 KiB
granularity (one dma-buf page per 64 KiB of GPU memory), and these physical
addresses are stored in a lookup table (LUT) indexed by GPU page. At
submission time, PRP entries are built at 4 KiB granularity by computing
sub-page offsets within each 64 KiB LUT entry. This matches the host page
size used by NVMe for PRP construction and sector alignment.
Memory Architecture#
This backend uses a hybrid memory model:
Structure |
Location |
Reason |
|---|---|---|
Data buffers ( |
GPU device memory (CUDA heap, 1 GiB) |
Transferred directly by the NVMe controller via PCIe P2P, bypassing host DRAM |
SQ, CQ, PRP lists |
Host hugepage memory (host heap, 256 MiB) |
The CPU writes and the NVMe controller DMA-reads these structures; host-accessible memory is required |
The NVMe data path goes GPU ↔ NVMe without touching host DRAM. The control path (submission queue entries, completion queue entries, PRP lists) still flows through host memory. As a result, both the CUDA heap and the host hugepage runtime are initialized when the first upcie-cuda device is opened.
Kernel Requirement#
Physical address resolution for CUDA device memory relies on importing a
dma-buf exported by the CUDA driver into the kernel’s DMA subsystem. This
requires a Linux kernel patch that is not yet upstream:
Build and install that patched kernel before using the upcie-cuda backend.
System Configuration#
Hardware Requirements#
P2P transfers require a GPU with a sufficiently large BAR1 window. BAR1 maps GPU device memory into the host PCIe address space and must be large enough to cover the CUDA heap (1 GiB by default). To check the available BAR1 size:
nvidia-smi -q -d memory
Look for the BAR1 Memory Usage section. If Total BAR1 is smaller than the
heap size, initialization will fail.
Hugepages#
In addition to the CUDA heap, opening an upcie-cuda device also initializes the host hugepage runtime (256 MiB) used to hold NVMe control structures. Follow the hugepage setup steps in uPCIe (host memory) before opening an upcie-cuda device.
Limitations#
GPU 0 only. The CUDA context and heap are always created on CUDA device 0. Multiple GPU support is not implemented.
1 GiB heap. The CUDA heap is fixed at 1 GiB. Allocations beyond this limit return
ENOMEM.No PRP list chaining. Each request has a single 4 KiB PRP list page (512 entries). Combined with PRP1, the maximum transfer size per command is 513 × 4 KiB ≈ 2 MiB.
No
buf_realloc. Buffer reallocation is not implemented and returnsENOSYS.No memory mapping (
mem_map/mem_unmap). These returnENOSYS.No pseudo commands. Show registers, controller reset, subsystem reset, and namespace rescan all return
ENOSYS.