Feb 18, 2025

The Rise of S3/RDMA: Modernizing Data Access for AI

The Rise of S3/RDMA: Modernizing Data Access for AI

Authored by

Sagi Grimberg, VP of Architecture

S3 has long been established as the de-facto standard interface for object storage. Its simplicity, scalability, and reliability have made it a cornerstone for data management and big data analytics.

More recently, S3 is making strides particularly in the realm of AI training and inference. However, as AI workloads grow in complexity and scale, the need for faster, more efficient data access is becoming increasingly important. 

Enter S3 over RDMA, a protocol designed to meet these demands by leveraging the power of Remote Direct Memory Access (RDMA) combined with the ubiquity of S3 object storage interface.

The Power of RDMA

AI training involves accessing these huge datasets repeatedly, and S3’s adoption has been driven by its capability to serve as the primary interface for this data, ensuring consistency, durability, and accessibility.

RDMA, meanwhile, historically has been the predominant network technology in HPC environments, has found new life in AI thanks to its ability to offer high-speed networking with minimal CPU involvement. RDMA enables data to move directly from the memory of one computer into that of another without touching the CPUs, which results in:

  • User-space networking: Network operations and data transfers are handled solely in user space, reducing overhead associated with system calls.

  • Zero-copy: Data is transferred without unnecessary copying from an intermediate network buffer to the user source buffer, leading to lower latency and higher bandwidth, often reaching 10s of GB/s per node.

  • Low CPU utilization: By offloading network processing associated with reliable delivery and ordering, the CPU is free for application specific computational tasks, which is crucial in AI where every bit of processing power counts.

S3/RDMA: A Synergy for AI Workloads

Combining S3 with RDMA creates S3/RDMA, a protocol that promises to dramatically improve how AI builders and consumers interact with data:

  • Speed and Efficiency: S3/RDMA can deliver the necessary speed for AI workloads, particularly in scenarios involving frequent checkpoints during long-running training jobs. The low latency and high bandwidth are key to maintaining the pace of modern AI training.

  • Enhanced Data Streaming: RDMA’s capability to stream data directly into memory without CPU intervention means that S3 data can be accessed and utilized with minimal delay. This is particularly beneficial for streaming large datasets into AI models, allowing for real-time or near real-time processing in training or inference serving use cases.

  • Portability and Cloud-Native: S3 is inherently cloud-native and portable, enabling seamless data movement across data centers and clouds. This portability becomes even more powerful with S3 over RDMA, which accelerates data access once it reaches its destination in a core datacenter, making it faster and more valuable for AI workloads. Innovations like CoreWeave’s AI S3 offering with VAST highlight this potential, demonstrating that S3 can deliver both cloud-native flexibility and high-performance speed.

VAST Recognition of S3/RDMA’s Potential

VAST has recognized the transformative potential of S3/RDMA and is actively integrating support for this protocol into our offerings. By adding S3/RDMA, we will provide customers with a robust solution that features a native multi-protocol system at extreme performance and scale with lower overall CPU utilization. 

The Vast Data Platform supports both file (NFS, SMB) and object (S3) access over Ethernet or Infiniband, enabling seamless data mobility for demanding AI training workloads. With GPU Direct Storage (GDS) compatibility for both NFS and S3, users will be able to efficiently manage diverse datasets across multiple data sources and even different data centers. As NVIDIA advances RDMA for AI, integrating GDS with S3/RDMA can ensure direct data transfers to GPU memory, bypassing CPU bottlenecks. This is particularly advantageous for GenAI workloads, where high data throughput is critical to performance.

Challenging Traditional HPC Paradigms

While traditional HPC advocates for parallel file systems as essential for AI, the market is witnessing a shift. The deployment of S3 checkpoints and data loaders in training environments showcases a preference for S3’s model. These tools not only manage data asynchronously but also cater to the dynamic, cloud-centric nature of modern AI development.

S3/RDMA: The Future of Scalable AI

S3/RDMA is not just an incremental improvement; it’s a paradigm shift towards how we think about data access in the context of AI and HPC.

By marrying the ubiquity and flexibility of S3 with the performance benefits of RDMA, we’re looking at a future where AI can scale faster, more efficiently, and with greater interoperability across different environments. As this protocol matures, it promises to underpin the next generation of AI infrastructure, making what was once considered high-performance computing accessible and practical for everyone from startups to large-scale enterprises.

Weigh in on S3/RDMA on Cosmos, the AI community for experts and enthusiasts to come together and shape the future of AI.

More from this topic

Learn what VAST can do for you
Sign up for our newsletter and learn more about VAST or request a demo and see for yourself.

By proceeding you agree to the VAST Data Privacy Policy, and you consent to receive marketing communications. *Required field.