Introduction#

AI workloads have driven unprecedented demand for storage bandwidth. Training, inference, and data-intensive models require accelerators to process massive datasets, yet storage access remains constrained by three independent bottlenecks: software abstraction overhead in the I/O stack, unnecessary data copies through host memory, and the lack of mechanisms for accelerators to participate directly in storage I/O. These are distinct problems with distinct solutions, and existing systems address them in isolation, often through proprietary or OS-incompatible means.

Software Abstraction Overhead#

The kernel storage stack interposes multiple software layers between an application and the NVMe controller: system call entry, VFS dispatch, file system logic, block-layer scheduling, and driver processing. Each layer adds latency and CPU overhead. Successive generations of I/O interfaces have worked to reduce this cost. Early interfaces such as pread/pwrite and POSIX aio gave way to Linux-specific libaio, which reduced system call overhead through kernel managed asynchronous I/O. io_uring advanced this further with shared submission and completion rings between user space and the kernel, enabling batched, polled I/O with minimal system call transitions. Most recently, io_uring_cmd extends this model to pass NVMe commands through the kernel driver with reduced block-layer overhead. Each generation reduced the software overhead, but all remain within the kernel storage stack.

SPDK [1] demonstrated that moving the NVMe driver entirely to user space, bypassing the kernel, eliminating interrupts and context switches, and using polling-based completion, could achieve dramatically higher I/O performance than any kernel path available at the time. It took roughly five years before the Linux kernel introduced io_uring, which narrowed the performance gap substantially. SPDK remains faster, though no longer by an order of magnitude. The architectural trade-off persists: SPDK requires exclusive device ownership, removing the NVMe controller from kernel management and requiring OS-provided abstractions to be rebuilt in user space or foregone entirely.

Unnecessary Data Copies#

Independent of software overhead, accelerator workloads suffer from redundant data movement. When a GPU requires data from storage, the conventional path routes it through host DRAM: the NVMe controller writes to host memory, and a separate copy transfers the data to GPU memory. Peer-to-peer (P2P) DMA (see PCIe) eliminates this intermediate copy by allowing the NVMe controller to transfer data directly to GPU memory over the PCIe fabric.

NVIDIA GPUDirect Storage (GDS) [2] was the first widely deployed system to achieve this, integrating P2P into the kernel I/O path through vendor-specific kernel modifications. GDS enables CPU-initiated NVMe commands with data buffers residing in GPU memory, bypassing host DRAM on the data path. However, GDS is proprietary and tightly coupled to the NVIDIA driver stack. Open alternatives using io_uring with dma-buf for P2P buffer management are under development within the Linux kernel.

Device-Initiated I/O#

A third, independent direction is device-initiated I/O, where the accelerator itself constructs and submits NVMe commands. This removes the host CPU from the command path entirely, allowing the accelerator to drive I/O at its own pace without waiting for CPU-mediated scheduling. Early academic work on this includes libnvm [3], which demonstrated GPU-resident NVMe driver code capable of submitting commands directly from GPU threads. BaM [4] extended this with NVMe queue partitioning and a demand-paging model for fine-grained GPU-driven storage access.

On the proprietary side, NVIDIA’s SCADA (SCalable Accelerated Data Access), part of the StorageNext initiative, pursues device-initiated storage access through a client-server architecture with a user space NVMe driver and a proprietary GPU-oriented I/O protocol. SCADA interposes a user-configurable software cache in GPU HBM between the application and storage; its performance gains derive primarily from cache hits rather than from more efficient I/O submission [5].

These systems achieve high performance but do so by abandoning file systems, POSIX semantics, or interoperability with OS-managed storage, or by relying on proprietary infrastructure.

AiSIO#

The presented approaches share a common limitation: each targets one bottleneck without addressing the others, and each does so at a cost to OS interoperability or file system semantics. An open system that combines all three approaches while preserving OS-managed storage semantics does not yet exist.

Accelerator-integrated Storage I/O (AiSIO) designates a class of system software architectures that address all three bottlenecks through open, composable components while preserving interoperability with OS-managed storage. AiSIO systems reduce software overhead through user space NVMe drivers, eliminate unnecessary copies through P2P DMA with dma-buf, and enable device-initiated I/O through device-resident NVMe drivers. The CPU and operating system retain responsibility for global coordination, device management, metadata handling, and policy enforcement. Accelerators participate in data-path execution or initiate I/O under host coordination, using open interfaces rather than proprietary driver stacks.

Host Orchestrated Multipath I/O (HOMI) serves as a reference architecture and implementation within the AiSIO class. HOMI demonstrates how these principles can be realized in open and modifiable system software, enabling multiple I/O paths to coexist with shared access to storage resources while preserving operating-system semantics.

This paper introduces AiSIO as a conceptual framework, defines a taxonomy of I/O paths within that framework, presents the HOMI architecture, and describes a proof-of-concept implementation evaluated through a series of synthetic benchmarks.

Introduction

Contents

Introduction#

Software Abstraction Overhead#

Unnecessary Data Copies#

Device-Initiated I/O#

AiSIO#