Techniques that once drove HPC performance—like extreme bandwidth at any cost—can now work against AI projects that need to scale efficiently. Below is a structured look at key steps that help leaders in AI infrastructure design balance power considerations, capacity, and the performance needs of modern AI clusters.
1. Power is Scarce, Performance is Not
Traditional HPC systems focused on raw bandwidth, but today’s AI clusters face a new bottleneck: power. Training clusters with thousands of GPUs can quickly reach power draws in the megawatt range. Every watt that goes to spinning disks or smaller SSDs is a watt you cannot allocate to compute. One reason is that data centers often have tight power caps, not to mention timelines for expanding grid capacity that extend over years. By accepting that power is finite, you start to look for storage solutions that minimize energy overhead while still delivering the necessary I/O speed for AI.
2. Calculate the True Cost of Storage Overkill
HPC configurations often revolve around maximizing performance per terabyte, which may involve deploying vast numbers of smaller SSDs. This approach might look appealing on a specification sheet, but in a multi-petabyte or exabyte environment, each drive draws additional power around the clock. For instance, a difference of 100,000 extra SSDs for a two-exabyte system can result in a multi-megawatt hike in power usage. That is enough energy to fuel over a thousand GPUs instead. When you weigh these tradeoffs, the cost of HPC-style overkill becomes impossible to ignore.
3. Remember That Capacity Delivers Parallel Performance
Large-format SSDs, in the 60TB–120TB range, bring a surprising amount of performance to AI workloads. If you distribute data across thousands of these higher-capacity drives, your overall throughput can be enormous. This parallelism means that even demanding tasks like checkpointing multi-terabyte models (common in large language model training) can finish on schedule. The real benefit is that you achieve strong I/O performance while minimizing the number of drives—and thereby your power draw.
4. Be Wary of Tiered Storage and Manual Movement
HPC best practices might suggest a separate “scratch” layer to capture temporary high-velocity data. That can be helpful when your workloads revolve around large, carefully scheduled batch jobs. AI training, however, is more dynamic and often demands rapid iteration. Adding another layer of data management can introduce delays and complexity. Single-tier, all-flash systems can eliminate the burden of moving data back and forth between tiers. This simpler design also consumes fewer resources overall, since you don’t need to power entire arrays dedicated to short-term storage.
5. Optimize Checkpointing with Asynchronous Writes
Large AI models need consistent checkpointing to safeguard progress, which can create bursts of high I/O. A well-designed storage system with many parallel SSDs can handle these writes without slowing the GPUs. More recently, the industry has shifted increasingly to checkpointing asynchronously, whereby the compute cluster continues its training work while local copies of checkpoint data are written to remote storage in the background. As a result, most modern AI infrastructure no longer needs a separate high-performance scratch storage system, resulting in a smaller hardware footprint, thus saving both power and space.
These steps illustrate why HPC-era approaches can hamper AI progress at very large scales. By accepting that power is finite and evaluating storage choices in light of that, you gain flexibility for where it matters most—training. Focusing on fewer, higher-capacity drives and streamlined design helps AI initiatives avoid bottlenecks and reach model objectives more efficiently. The opportunity is significant, and it begins with reimagining how storage fits into a next-generation AI data center.
What are we missing? What are your experiences designing storage for HPC and AI environments? Join the conversation on Cosmos, the VAST community for AI professionals and innovators to share insights and build lasting partnerships across the global ecosystem.