At GTC 2025, NVIDIA’s Head of Research and Chief Scientist, Bill Dally, gave a characteristically dense and idea-rich talk—a rapid-fire tour through years of deeply technical work that culminated in a few core breakthroughs worth paying attention to.
While the entire session was worth the price of admission for anyone watching the evolution of GPU architecture and AI infrastructure, there was one thread in particular that should jump out at those of us thinking about data at scale: interconnects.
Dally didn’t just showcase faster chips or bigger models—he walked us through the often-overlooked plumbing that makes modern AI even possible.
In his view, and increasingly in ours, the real limiting factor in accelerated computing isn’t how fast you can multiply matrices. It’s how efficiently you can move data. “Most of the energy in a GPU,” he said, “isn’t spent on arithmetic—it’s spent moving data around.”
This shift in perspective is more than philosophical—it’s architectural. NVIDIA’s internal solutions are stunning in their own right: chip-to-chip connections with hybrid bonding pitches down to 9 microns, interposer-level signaling at 25–50 Gbps per pin, and ground-referenced signaling innovations that drastically cut energy use. The interconnect stack—from NVIDIA Grace to Blackwell to NVSwitch to rack-scale optical fabrics—is now a primary design point.
Which brings us right back to data: If NVIDIA is building one big accelerated compute stack, VAST is here to feed it.
When Dally described the need for ultra-high-bandwidth, energy-efficient, hierarchically aware interconnects, he was talking about NVIDIA’s internal engineering goals. But he was also sketching a blueprint for what the rest of the system must become. Accelerated compute stacks can no longer be starved for data, and traditional tiered storage—with its latency cliffs and bandwidth mismatches—has no place in this world.
VAST solves this with an architecture that mirrors the core design principles Dally laid out:
Parallelism everywhere.
Just as NVIDIA spreads compute across multiple dies and packages, VAST ensures every GPU, CPU, and node sees a consistent, high-speed view of the data. Further, data movement is expensive. VAST reduces the movement and its cost by collapsing tiers, bringing hot and cold data together in a single, fast-access layer.
And on the bottlenecks front, like NVIDIA’s simultaneous bidirectional signaling, VAST’s system ensures that reads and writes don’t compete or degrade under pressure. The system stays fast (in both directions), even as scale increases.
Of course, all of this becomes even more critical in AI inference—especially with the rise of token-by-token decoding in LLMs. Dally’s segment on RocketKV showed how NVIDIA trims memory bandwidth usage internally with smart cache pruning. VAST complements this by solving the external bandwidth problem: accelerating initial KV cache loads, supporting shared context across inference nodes, and eliminating delays between question and response.
In this sense, VAST becomes more than a storage system. It’s the data substrate for accelerated computing. If NVIDIA accelerated computing is the brain, VAST is the circulatory system—designed not just for capacity, but for low-latency, high-volume, always-on delivery of intelligence-grade information.
Dally’s talk was a map. For those of us building the terrain around it, the message is clear: interconnects are the next frontier. VAST is already there.
As the AI bottleneck shifts from compute to data, the need for a new approach to data infrastructure has never been more clear. Join the discussion on Cosmos to share strategies and insights on overcoming I/O constraints and enabling fully saturated GPU pipelines.