Since Magnum IO GPUDirect Storage was originally announced at SC19, the storage world has been on the edge of its seat waiting for NVIDIA to unveil its CPU-bypass capability. GDS is a transformative approach to handling GPU I/O – allowing storage systems to write data directly into GPU memory using RDMA.
GDS is not just an I/O accelerator. By bypassing the CPU, customers also benefit from better CPU utilization when CPUs are not used for shipping data. A picture says 1,000 words, so here’s how it works:
NVIDIA finally took the covers off of GPUDirect Storage performance at GTC20. Here, VAST’s was used as the reference for what good looks like (see slide 15 from the NVIDIA talk). While GDS is no longer a new topic, there are still many lingering questions we find from customers as they consider their options. In support of NVIDIA’s launch of GDS 1.0, we wanted to take a moment to dispel some misconceptions and discuss how GDS can be a universal solution for making I/O better with Universal Storage.
Myth #1: GDS is Only for AI Acceleration
False. GDS acceleration applies to any application that needs enhanced bandwidth or CPU efficiency. AI is a big driver of this effort – where GDS accelerates CUDA-enabled applications and has specific optimizations for frameworks such as PyTorch. Beyond AI, we’ve also been pleasantly surprised to see the adoption of this technology well beyond classic AI frameworks:
- In media and entertainment – a leading telco has selected VAST Universal Storage technology to provide GDS-accelerated performance for volumetric video capture. When GPU machines need to simultaneously ingest multiple high definition video streams in order to stitch together a 3D video, fast bandwidth is critical to ensuring that no frames are dropped and that each step of the pipeline is GPU efficient.
- A leading quantitative trading firm is using Universal Storage to accelerate their big data pipeline – running Spark on the RAPIDS framework to outperform their legacy Spark implementation. Here, an in-memory computing framework benefits greatly from the high-throughput that comes from VAST’s multi-path NFS with RDMA.
- Beyond these use cases, NVIDIA has prebuilt frameworks to support a variety of applications including high performance visualization (IndeX) and healthcare (CLARA).
Myth #2: The Only Way to Use GDS is to Consume a Pre-Built NVIDIA Library
False. We now have several customers across a variety of industries (all of which are writing their own applications) who are programming their applications to natively leverage the optimized cuFile-based method of writing data directly into GPU memory. POSIX pread and pwrite require buffers in CPU system memory and an extra copy, but cuFile read and write only requires file handle registration. More information on how to leverage the cuFile API and GPUDirect Storage for your applications can be found here.
Myth #3: All GPUDirect Storage Client Access (using NFS) is Born Equal
False. NFS is experiencing a performance renaissance and is helping companies realize that the complexities of parallel file systems aren’t worth the hassle when NFS can be equally capable. VAST is leading this effort globally by popularizing support for NFS over RDMA and by collaborating with NVIDIA on enabling GDS for NFS. That said, while other vendors have followed VAST’s lead, not all NFS client access is created equally.
In addition to supporting RDMA and GDS, VAST has also extended support for multi-pathing in NFSv3. With this, you can federate all I/O from a single NFS mountpoint across multiple client-side network ports, bringing new availability and performance improvements for machines such as NVIDIA’s DGX A100 servers (configurable with up to ten 200Gb-capable network ports!). To understand the accretive benefits of GDS on NFS multi-path, we recently measured a baseline test using a single GPU with a single NIC. Using GPUDirect Storage, a single A100 GPU approached full saturation of the network card by squeezing out an additional 20% of throughput, while simultaneously reducing the CPU utilization of the client by nearly 9X.
What’s more interesting, perhaps, is seeing how GDS with NFS multi-path scales in performance as you add more network bandwidth to high-end compute servers. To truly test the perceived limits of NFS performance, we conducted an experiment using both NFS multi-path and GDS on a DGX A100 server configured with eight 200Gb InfiniBand NICs connected to a VAST storage cluster. With access to all that additional network bandwidth, NVIDIA measured over 162GiB/s of read performance via a single NFS mountpoint. When you can remove all of the I/O bottlenecks for single-machine I/O, this is the result:
Myth #4: GPUDirect Storage Using RDMA Requires InfiniBand Networks
False. While NFS over RDMA is certainly a stable and high performance option to run over MOFED and OFED InfiniBand networks, it also works equally well over RDMA over Converged Ethernet (RoCE). VAST customers run NFS over RDMA (and GDS) over both InfiniBand and several flavors of data center Ethernet networking.
Myth #5: All Scale-Out NAS Servers for GPUDirect Are Born Equal
False. VAST is being installed into GPU environments where customers have 100s or 1000s of GPUs. When you get to this level of investment, scale and how storage architectures scale really matters.
Conventional ‘shared-nothing’ approaches to storage scaling introduce laws of diminishing returns that creep in at scale as conventional storage technologies wrestle with coordinating updates and I/Os across a number of nodes. As nodes talk to each other – the communication grows geometrically with cluster scale and read/write requests must be serviced indeterminately by other partner nodes in a cluster who themselves might be overloaded, dealing with errors or performing rebuilds. These interdependencies are the enemy of storage scaling.
VAST’s Universal Storage architecture has pioneered a new form of Disaggregated and Shared Everything () scaling. By eliminating east-west traffic in a storage cluster, each of the VAST file/object servers has a direct path down to a shared pool of SSDs and doesn’t need to wait for any other machine in order to complete an I/O. This new architecture enables embarrassingly-parallel scale and is the basis for how VAST is building clusters that easily scale to deliver TB/s of performance and millions of IOPS.
Hopefully that helps answer some of the questions on this emerging field of accelerated I/O for next-generation GPU systems. To learn more about what VAST’s support for GPUDirect Storage support can do for your applications, give us a shout.
 Performance numbers shown here with NVIDIA GPUDirect Storage on NVIDIA DGX A100 slots 0-3 and 6-9 are not the officially supported network configuration and are for experimental use only. Sharing the same network adapters for both compute and storage may impact the performance of any benchmarks previously published by NVIDIA on DGX A100 systems.