We’ve Got The Write Stuff… Baby

Author

Jeff Denworth, Co-Founder

Fun Fact: The original name of VAST Data was Random Reads. Terrible name, right? If history took another path, we wouldn’t be VASTronauts, we’d be Speed Readers? Hard pass.

Today, we’re talking about read/write balance.

For 6 years, the laws of performance amortization have allowed VAST systems to deliver more than enough read performance and more than enough write performance to applications that we power. Because we make all-flash ultimately affordable and help customers consolidate tiers of data infrastructure, the aggregate NVMe performance from VAST is more than what organizations would otherwise be able to squeeze out of competing, expensive (and therefore small), NVMe systems.

While our customers have gotten the performance that they’ve needed from VAST clusters, our systems have tilted more toward read bandwidth than write bandwidth in terms of I/O balance. When we started, we wanted to optimize first for the most popular workloads that our system was designed for: analytics pipelines (20:1 R/W), render farms (5:1 R:W), AI computer vision (100:1 R:W), HPC (3:1 R:W), etc. Essentially every application except data protection has had a much larger read I/O requirement than write I/O - and backup systems don’t get close to saturating VAST clusters.

Fast forward to 2024, where the parameter space has changed. Deep Learning supercomputers are now getting spectacularly large, and the need to checkpoint AI applications quickly is growing disproportionately to the growing size of training datasets. To give some perspective - whereas a big AI supercomputer in 2022 was built from 5,000 GPUs, generally intelligent foundational models will soon be trained on systems that have over 100,000 GPUs in a single cluster.

Application-wise: as training jobs start to scale into the 1,000s of GPUs, an AI cluster’s mean-time-between-failure (MTBF) becomes a critical aspect of system and application design. Every parallel training application requires each cluster component to work flawlessly during a training job… which is an unrealistic expectation given the complexity, power, heat and software challenges of these large systems. These training jobs can run for days or even months in certain cases, so customers introduce defensive I/O techniques to preserve application state and make it possible to rollback to a consistent application checkpoint. Checkpoint frequency is then proportional to the scale of the training job - as scale introduces higher MTBF. In summary, the bigger the training job, the more it needs to write.

Do 100,000 GPU clusters see 100,000 GPUs writing checkpoints data all at the same time? No. AI checkpointing looks nothing like HPC checkpointing, as my colleagues Dr. Kartik and Dr. Colleen explored in a recent blog about AI checkpointing. Despite the fear-mongering of HPC storage companies, the workloads are actually quite reasonable. Microsoft, for example, has been busy helping popularize new checkpoint methods where GPU machines first checkpoint their model’s weights to a peer-machine’s memory (saving time vs writing to disk) and then asynchronously drain checkpoint data down to shared storage. While writes to disk aren’t in the application path, they still need to be ingested by data infrastructure with poise.

OK, so what’s the point?

Large AI supercomputers are starting to shift the balance of read/write I/O and we at VAST want to evolve with these evolutions. Today, we’re announcing two new software advancements that will serve to make every VAST cluster even faster for write-intensive operations.

Introducing: SCM RAID

To date, all writes into a VAST system have been mirrored into storage class memory devices (SCM). Mirroring data is not as performance-efficient as using erasure codes, since with mirroring every I/O takes 2 drives to complete. Starting in 5.1 (available in April 2024), we’re proud to announce that the write path will now be accelerated by RAIDing data as it flows into the system’s write buffer. This simple software update will introduce a performance increase of 50%. Without any HW-based acceleration, each VAST cluster will automatically start writing data 50% faster. Erasure codes also help increase the resilience of the write buffers in large systems, since you’ll have 2x the number of protection drives (going from 1+1 in a mirror to 6+2 RAID).

Introducing: Spillover

VAST systems are typically configured with 3x more Flash Drives (today, QLC) than SCM drives. Our write buffers have been limited to the SCM devices in the system. SCM has the benefit of allowing our system to buffer writes without worrying about I/O alignment (something that must be optimized for QLC drives (that have low-endurance and large erase blocks) as well as data reduction (where you don’t want to write data to QLC and then write reduced data post-process, this just unnecessarily wears down low-endurance flash). But what about short-lived data, such as AI application checkpoints? These large checkpoint files can be written in full erase blocks down to flash, and they don’t get retained in the same way that application data does - therefore they don’t require post-process data reduction. So, let’s accelerate this with a new approach.

Later this summer (2024), version 5.2 of VAST OS will support a new mode where large checkpoint writes will spillover to also write directly into QLC flash. This method intelligently detects when the system is frequently being written to and allows large, transient writes to spillover into QLC flash. The net result is a system architecture that brings all of the SSD performance to bear when dealing with high-performance checkpoint applications while further shifting the balance of read/write I/O more in favor of defensive I/O.

This new approach is thoughtful and will not jeopardize our ability to ensure a decade of QLC SSD endurance, a commitment we make to customers that allows us to remove the guesswork from building low-cost NVMe infrastructure.

AI Acceleration

When considering how we configure these systems for large AI computers, write-intensive configurations will see a 62% reduction in required hardware within 6 months time. We’re now guiding customers building large GPU infrastructure that minimally-configured systems can be even more affordable. With VAST’s data reduction, we preserve our ability to deliver big systems in small form factors. Our customers who do LLM training typically see 2:1 data reduction for training datasets - VAST OS’s innovations in system efficiency realizes tremendous cost, power and space savings when it comes to data capacity.

If we take our performance sizing for NVIDIA DGX SuperPOD as a benchmark, our write flow optimizations dramatically reduce the amount of hardware needed for a single Scalable Unit (SU).

The Write Stuff

It’s been easy for VAST competitors to mischaracterize our read/write balance as a deficiency of our architecture. What these companies haven’t realized is that our architecture is far more powerful and flexible than they fear, and we still have a ton of optimization left to do. In 5.0, for example, we introduced 11 performance accelerations that speed up certain I/O by as much as 50%. SCM RAID is just one example of what we have on tap for 5.1. Our engineers are optimizing new aspects of our code daily, and there’s still so much more we look forward to showing you.

So bring on the hyperscale AI workloads!... and we’ll gobble up that data 230% faster so that GPU clusters can focus their might on pioneering new frontiers of AI-powered discovery.

Thx. - Jeff

We’ve Got The Write Stuff… Baby

Introducing: SCM RAID

Introducing: Spillover

AI Acceleration

The Write Stuff

More from this topic