solutions
Feb 25, 2025

VAST Powers Blazing-Fast, S3-Native Model Streaming and Data Processing with NVIDIA Run:ai

NVIDIA Run:ai and VAST: Redefining Model Streaming, Data Availability, and Parallel Processing

Posted by

Dave Graham, Technical Marketing Manager

Artificial intelligence has evolved into a real-time decision-making engine, powering everything from business analytics to real-time chatbots and reasoning, problem-solving agent. However, current AI infrastructure still suffers from an ingrained latency problem—a fundamental expectation that AI workloads will involve delays in data movement, model loading, and computation.

We've been conditioned to believe that model loading and availability come with significant wait times. From transferring massive datasets to training foundation models, delays feel inevitable for training. Likewise, indexing and retrieving content for RAG and loading different models and user contexts introduces delays in inference. We notice the differences, slight as they may be, and they reinforce our belief in the importance of immediacy and the struggles of traditional AI infrastructure to enable it.

Recently, Glenn Lockwood wrote that AI training infrastructure should have a scalable object store and node-local NVMe storage. This is based on his history with some of the largest supercomputing clusters in the world. But what if AI models were ready when you needed them, without bottlenecks caused by storage access, data loading, or computational inefficiencies?

What if you could combine the node-local NVMe performance with a scalable S3 object store and not lose any performance?

These are the questions that VAST Data is solving with NVIDIA Run:ai, and a key breakthrough is the Run:ai Model Streamer, an open-source project designed to accelerate model loading and reduce cold-start times for large language models (LLMs). By optimizing tensor serialization and streaming directly from object storage, Model Streamer ensures that AI models are ready to execute almost instantly. In collaboration with VAST Data’s high-performance S3 interface, this solution achieves near-local NVMe speeds at scale, removing traditional storage bottlenecks imposed by legacy filesystems and model loaders and improving AI performance.

VAST Data with NVIDIA Run:ai: Data Availability Without the Latency Drag

At the core of AI performance is data accessibility. Large-scale AI workloads require high-throughput data pipelines, but traditional storage architectures create major friction points:

  • Data gravity: AI models rely on vast datasets, but moving data across storage tiers (local, network, or cloud) introduces delays

  • I/O bottlenecks: Traditional storage solutions often fail to meet the throughput demands of modern AI applications

  • Fragmentation: Disparate storage systems force data duplication and migration, increasing both latency and cost

VAST Data with NVIDIA Run:ai: Global S3 Storage at Local NVMe Speeds

images

Figure 1 Time to Load Models: Seconds by Concurrent Threads

The NVIDIA team utilized the Meta-Llama-3-8b model and Model Streamer version 0.6 to analyze the impact of optimized model loading on the underlying data infrastructure. The project revealed that when integrated with Run:ai Model Streamer:

  1. The VAST Data Platform can bridge the gap between local, direct-attached NVMe devices—which face challenges with data gravity and limited expansion—and an exabyte-scale data platform. Previously, AI teams depended on capacity-constrained local NVMe disks or direct-connect arrays for performant storage. Now, they can scale and share data seamlessly using our extremely fast, low-latency S3 implementation for objects.

  2. In some cases, the loading time difference between local NVMe storage and the VAST Data Platform was less than 0.6%, challenging the notion that local disks are the most performant option. By integrating directly with the VAST Data Platform’s object storage, this solution delivers storage access speeds comparable to local NVMe SSDs. As a result, the solution can provide:

By collapsing the traditional storage-compute gap, VAST Data with NVIDIA Run:ai ensures that AI models no longer wait for data. Instead, they are immediately available on the VAST Data Platform at the speed of execution.

Bringing It All Together: AI at the Speed of Thought

VAST Data with NVIDIA Run:ai doesn’t just optimize AI model deployment—it redefines how enterprises approach AI latency, storage, and computation. By challenging the assumption that AI requires wait times, the solution enables:

  • Accelerated AI model availability through streaming, eliminating warm-up times

  • Real-time data access at local SSD speeds, removing storage bottlenecks

  • Parallelized AI execution, ensuring no computing cycles are wasted

Final Thoughts

For years, AI delays due to data movement and access have been considered normal, but it doesn't have to be this way. The future of AI lies in real-time processing, instant model availability, and seamless parallel execution.

The takeaway for CTOs and AI buyers is clear: If your AI infrastructure is still causing delays, it’s time to rethink your strategy.   CSPs and MSPs now have access to a powerful global namespace object store that delivers unprecedented flexibility and performance as customers incorporate AI workflows into their core operations.

The future of AI isn’t about waiting, it’s about action. And with VAST Data with NVIDIA Run:ai, that future is already here.

Want to learn more? Join us at NVIDIA GTC where Glenn Lockwood and other industry thought leaders will unpack this topic and more in greater detail. Add it to your GTC agenda here.

More from this topic

Learn what VAST can do for you
Sign up for our newsletter and learn more about VAST or request a demo and see for yourself.

By proceeding you agree to the VAST Data Privacy Policy, and you consent to receive marketing communications. *Required field.