Environments#
This section describes the system configurations used for experimenting with AiSIO system software architectures, including the initial AiSIO proof-of-concept and the development of a reference implementation. These environments support evaluation, document constraints encountered during development, and serve as reproducible reference builds for readers wishing to explore similar architectures.
The same environments were also used to reproduce, where possible, benchmarks from related work, specifically GDS [2] and BaM [4], ensuring controlled comparison under identical system conditions.
The configurations reflect practical system builds intended to support architectural exploration, implementation, and comparative evaluation, rather than prescribing production deployment guidelines.
All environments are configured with the platform prerequisites for P2P DMA: Resizable BAR enabled, Above 4G Decoding enabled, IOMMU disabled, and ACS disabled. Note that the availability and naming of these options varies across BIOS vendors; for example, Resizable BAR may appear as “ReBAR”, “Resizable BAR Support”, or “Re-Size BAR Support”.
High-Performance Compute Server#
An enterprise-grade dual-socket server with high-capacity NVMe storage, used for the primary CPU-initiated I/O and benchmark tool comparison experiments. The platform, GPU, and NVMe storage all operate at PCIe Gen5, providing full-bandwidth connectivity throughout the system. With 32 × 64 GiB of DDR5, the system provides 2 TiB of host memory.
Hardware |
Details |
|---|---|
Motherboard |
Dell PowerEdge R760 |
CPU |
2x Intel® Xeon® Gold 6442Y |
Memory |
32x 64GiB Samsung DDR5 4800MHz |
GPU |
2x NVIDIA H100 80GB |
Storage |
16x Samsung SSD PM1753 32TB |
Professional GPU Workstation#
A workstation built around a professional-grade GPU with a high core-count CPU, used for development and evaluation of the reference implementation. The platform, GPU, and NVMe storage all operate at PCIe Gen4.
Hardware |
Details |
|---|---|
Motherboard |
Supermicro H12SSL-I |
CPU |
1x AMD EPYC 7532 32-Core (Rome) |
Memory |
8x 32GiB Samsung DDR4 2667MHz |
GPU |
1x NVIDIA RTX A5000 24GB |
Storage |
4x Samsung 990 PRO 2TB |
The system supports P2P between all seven PCIe slots, which can be populated with varying combinations of GPUs, NVMe devices, and RDMA NICs, making it an ideal testbed for accelerator-integrated storage I/O topologies.
Desktop Workstation#
A consumer desktop used for development and light experimentation. Both the NVIDIA RTX A2000 6GB and the NVIDIA RTX PRO 2000 Blackwell 16GB have been used and validated in this environment. The AMD B550 platform limits PCIe bandwidth to Gen4 on CPU-direct slots, constraining the RTX PRO 2000 Blackwell below its Gen5 capability. A single NVMe device limits aggregate storage bandwidth and precludes multi-device parallelism.
Hardware |
Details |
|---|---|
Motherboard |
MSI MAG B550M MORTAR WIFI |
CPU |
1x AMD Ryzen 7 5800X 8-Core |
Memory |
2x 16GiB DDR4 2133MHz |
GPU |
1x NVIDIA RTX PRO 2000 Blackwell 16GB / RTX A2000 6GB |
Storage |
1x Samsung 980 PRO 1TB |
Unlike the Professional GPU Workstation, P2P is only possible between devices installed in the PCIe expansion slot labeled PCI_E1 and the M.2 slot labeled M2_1. These are connected to the CPU root complex whereas the other PCIe and M.2 slots are connected to the platform controller hub (PCH).
Legacy GPU Server#
A server with legacy Volta-generation GPUs, used for pre-work [18] and the initial AiSIO proof-of-concept [21, 22]. Although the system is PCIe Gen4, the V100 is a PCIe Gen3 device, limiting peak GPU bandwidth to approximately 16 GB/s per slot and constraining peer-to-peer DMA throughput compared to Gen4 systems.
Hardware |
Details |
|---|---|
Motherboard |
Gigabyte G292-Z20 |
CPU |
1x AMD EPYC 7402P 24-Core |
Memory |
8x 32GiB SK Hynix DDR4 2400MHz |
GPU |
2x NVIDIA V100 16GB |
Storage |
4x Samsung 980 PRO 1TB |
The system has four risers, each connected to the motherboard via a PCIe Gen4 x16 slot and carrying a Microsemi PCIe Gen4 switch with two x16 downstream ports. Notably, data can move P2P via the switch without traversing the CPU root complex.