Future Work#

The work presented here is bounded in two important ways. First, HOMI is a reference implementation under active development rather than a complete system: several architectural components described in Section Architecture are not yet realized in the current PoC. Second, the scope is limited to locally-attached NVMe storage, leaving remote and disaggregated storage as an open direction. The following sections describe the most significant areas of future work along these dimensions and others.

Completing the HOMI Reference Implementation#

The most immediate area of future work is completing the HOMI reference implementation described in Section Architecture. The current PoC demonstrates key aspects of the design in a reduced form, but the full reference implementation requires several components not yet in place: a persistent host-resident control-plane daemon, dynamic provisioning and reassignment of NVMe queue resources across initiators, and coordinated lifecycle management spanning OS-managed, user space managed, and device-initiated I/O paths. Dynamic queue management is of particular importance, as it is a prerequisite for supporting workloads where the set of active initiators changes at runtime. The current PoC relies exclusively on SR-IOV for hardware-assisted queue isolation, a feature limited to datacenter-grade NVMe devices. Completing the HOMI reference implementation includes realizing the ublk-based software-mediated multipath configuration described in Section Architecture, which removes this hardware dependency and enables the architecture to operate on commodity storage hardware.

Kernel Integration and Upstream Components#

The udmabuf-import patch, which extends the udmabuf driver to import arbitrary dma-buf file descriptors and expose physical address mappings to user space, is currently maintained as an out-of-tree kernel patch. Upstreaming this interface, or contributing an equivalent mechanism through a suitable kernel subsystem, would remove the requirement for a custom kernel build and allow the user space P2P path to be exercised on unmodified production systems.

The Linux kernel’s io_uring and dma-buf integration for CPU-initiated P2P I/O is under active development in mainline. As this path stabilizes, a direct comparison between the kernel-managed and user space managed P2P architectures described in Section Architecture becomes possible on identical hardware — an evaluation that would clarify the performance and operational trade-offs between the two approaches.

Broader Accelerator Support#

The current PoC is developed and validated against NVIDIA GPUs using CUDA for device memory allocation and dma-buf export. The I/O path itself — built on xNVMe, uPCIe, and dma-buf — is not NVIDIA-specific, as these components operate on any dma-buf exporter. The CUDA dependency is therefore confined to the memory management layer. Extending support to AMD GPUs via ROCm requires work in three areas. First, device memory allocation must be ported from cuMemAlloc to the HIP equivalent. Second, dma-buf export must be adapted from cuMemGetHandleForAddressRange to the corresponding amdgpu kernel driver interface. Third, the device-resident NVMe driver must be ported from CUDA to HIP; this is the most substantial effort.

Multi-Accelerator Topologies#

While multi-accelerator support is a goal of this work, only single-accelerator configurations have been targeted so far. Dynamic queue management is a prerequisite for this. Beyond that, achieving multi-accelerator support also requires accounting for PCIe topology effects on P2P transfer latency and bandwidth, and managing concurrent access to shared namespaces from multiple devices within the HOMI control plane.

Remote Storage and RDMA#

The current work is scoped to locally-attached NVMe storage. Extending AiSIO to remote storage is an open direction, with two distinct approaches under consideration. The first is NVMe-oF, carrying NVMe commands over RDMA-capable transports such as RoCE or InfiniBand, which preserves the block-level access model of the locally-attached case. The second is pNFS, which exposes distributed storage while preserving file system semantics at the protocol level. In both cases, the goal is to maintain the core properties of the AiSIO architecture — P2P data movement directly into accelerator memory and device-initiated I/O — while operating against remote targets.

Evaluating Device-Initiated Paths#

Initial benchmarking of device-initiated I/O is in place: xnvmeperf’s cuda-run subcommand drives NVMe I/O entirely from CUDA kernels, and I/O size scaling and queue depth scaling experiments are complete. The next step is integrating device-initiated I/O into FIL to evaluate performance with file-based workloads, where block translation through XAL and the full AiSIO stack are exercised end-to-end.

Future Work

Contents