Artificial intelligence has evolved into a real-time decision-making engine, powering everything from business analytics to real-time chatbots and reasoning, problem-solving agent. However, current AI infrastructure still suffers from an ingrained latency problem—a fundamental expectation that AI workloads will involve delays in data movement, model loading, and computation.
We've been conditioned to believe that model loading and availability come with significant wait times. From transferring massive datasets to training foundation models, delays feel inevitable for training. Likewise, indexing and retrieving content for RAG and loading different models and user contexts introduces delays in inference. We notice the differences, slight as they may be, and they reinforce our belief in the importance of immediacy and the struggles of traditional AI infrastructure to enable it.
Recently, Glenn Lockwood wrote that AI training infrastructure should have a scalable object store and node-local NVMe storage. This is based on his history with some of the largest supercomputing clusters in the world. But what if AI models were ready when you needed them, without bottlenecks caused by storage access, data loading, or computational inefficiencies?
What if you could combine the node-local NVMe performance with a scalable S3 object store and not lose any performance?
These are the questions that VAST Data is solving with NVIDIA Run:ai, and a key breakthrough is the Run:ai Model Streamer, an open-source project designed to accelerate model loading and reduce cold-start times for large language models (LLMs). By optimizing tensor serialization and streaming directly from object storage, Model Streamer ensures that AI models are ready to execute almost instantly. In collaboration with VAST Data’s high-performance S3 interface, this solution achieves near-local NVMe speeds at scale, removing traditional storage bottlenecks imposed by legacy filesystems and model loaders and improving AI performance.
VAST Data with NVIDIA Run:ai: Data Availability Without the Latency Drag
At the core of AI performance is data accessibility. Large-scale AI workloads require high-throughput data pipelines, but traditional storage architectures create major friction points:
Data gravity: AI models rely on vast datasets, but moving data across storage tiers (local, network, or cloud) introduces delays
I/O bottlenecks: Traditional storage solutions often fail to meet the throughput demands of modern AI applications
Fragmentation: Disparate storage systems force data duplication and migration, increasing both latency and cost
VAST Data with NVIDIA Run:ai: Global S3 Storage at Local NVMe Speeds

Figure 1 Time to Load Models: Seconds by Concurrent Threads
The NVIDIA team utilized the Meta-Llama-3-8b model and Model Streamer version 0.6 to analyze the impact of optimized model loading on the underlying data infrastructure. The project revealed that when integrated with Run:ai Model Streamer:
The VAST Data Platform can bridge the gap between local, direct-attached NVMe devices—which face challenges with data gravity and limited expansion—and an exabyte-scale data platform. Previously, AI teams depended on capacity-constrained local NVMe disks or direct-connect arrays for performant storage. Now, they can scale and share data seamlessly using our extremely fast, low-latency S3 implementation for objects.
In some cases, the loading time difference between local NVMe storage and the VAST Data Platform was less than 0.6%, challenging the notion that local disks are the most performant option. By integrating directly with the VAST Data Platform’s object storage, this solution delivers storage access speeds comparable to local NVMe SSDs. As a result, the solution can provide:
Seamless large-scale data access, removing the need for time-consuming data transfers
A unified storage layer, eliminating data silos and reducing duplication costs
High-performance AI pipelines that ensure data is always available the moment it’s needed
By collapsing the traditional storage-compute gap, VAST Data with NVIDIA Run:ai ensures that AI models no longer wait for data. Instead, they are immediately available on the VAST Data Platform at the speed of execution.
Bringing It All Together: AI at the Speed of Thought
VAST Data with NVIDIA Run:ai doesn’t just optimize AI model deployment—it redefines how enterprises approach AI latency, storage, and computation. By challenging the assumption that AI requires wait times, the solution enables:
Accelerated AI model availability through streaming, eliminating warm-up times
Real-time data access at local SSD speeds, removing storage bottlenecks
Parallelized AI execution, ensuring no computing cycles are wasted
Final Thoughts
For years, AI delays due to data movement and access have been considered normal, but it doesn't have to be this way. The future of AI lies in real-time processing, instant model availability, and seamless parallel execution.
The takeaway for CTOs and AI buyers is clear: If your AI infrastructure is still causing delays, it’s time to rethink your strategy. CSPs and MSPs now have access to a powerful global namespace object store that delivers unprecedented flexibility and performance as customers incorporate AI workflows into their core operations.
The future of AI isn’t about waiting, it’s about action. And with VAST Data with NVIDIA Run:ai, that future is already here.
Want to learn more? Join us at NVIDIA GTC where Glenn Lockwood and other industry thought leaders will unpack this topic and more in greater detail. Add it to your GTC agenda here.