Welcome to AISIO Documentation# Welcome to AISIO Documentation Abstract Introduction Software Abstraction Overhead Unnecessary Data Copies Device-Initiated I/O AiSIO Background Terminology CPUs CPU Architecture and Components CPU Frequency and Scaling Simultaneous Multithreading (SMT) Memory Virtual Memory Abstraction Contiguous vs. Non-Contiguous Memory Memory Pinning Direct Memory Access (DMA) Non-Uniform Memory Access (NUMA) Memory-Mapped I/O (MMIO) PCIe Address Translation: ACS, IOMMU, IOVA, ATS, and ATC Routing: Root Complex Mediated Routing: Switch Mediated Routing: Device-routed (ID-routed) Multi-function devices Summary Massively Parallel Processing Units CUDA NVMe Controllers Controller Enablement and the Admin Queue I/O Queue Pairs Command Construction Payload Description Doorbells and Completion NVMe Drivers Driver Residency I/O Initiator Roles DMA Reachability Requirements Block Devices Files and File Systems POSIX Semantics Software Components Summary Architecture I/O Path Taxonomy The Coexistence Problem AiSIO System Architectures CPU-Initiated I/O with P2P Memory and Kernel Infrastructure CPU-Initiated I/O with P2P Memory and User-Space Infrastructure Device-Initiated I/O with P2P Memory and User-Space Infrastructure Host Orchestrated Multipath I/O (HOMI) HOMI via Software-Mediated Multipath (ublk) HOMI via Hardware-Assisted Delegation (SR-IOV) Related Work Implementation Proof-of-Concept Implementation HOMI Reference Implementation (Work in Progress) uPCIe Device Memory Physical Address Resolution via udmabuf-import xnvmeperf Device-Initiated Benchmarking Environments High-Performance Compute Server Professional GPU Workstation Desktop Workstation Legacy GPU Server Experimental Framework System Setup Benchmarks Synthetic File-based Experiments CPU-initiated I/O: Optimal Parameter Search Independent variables Metrics Collected Environment Execution of the Experiment Results Summary CPU-initiated I/O: Software Abstraction Overhead I/O Paths Comparisons and Their Interpretability Experimental Setup Results Summary CPU-Initiated P2P I/O: PCIe Bandwidth Saturation Independent Variables Metrics Collected Environment Execution of the Experiment Results Summary Device-initiated I/O: I/O Size Scaling Independent Variables Metrics Collected Environment Execution of the Experiment Results Summary Device-initiated I/O: Queue Depth Scaling Independent Variables Metrics Collected Environment Execution of the Experiment Results Summary Conclusion Future Work Completing the HOMI Reference Implementation Kernel Integration and Upstream Components Broader Accelerator Support Multi-Accelerator Topologies Remote Storage and RDMA Evaluating Device-Initiated Paths References