A Checkpoint on Checkpoints in LLMs

Author

Subramanian Kartik, PhD, Global Systems Engineering Lead and Colleen Tartow, PhD, Field CTO and Head of Strategy

If you’ve been following along with deep learning trends, you’re aware that the data and models in deep learning are big - like, really big. The datasets could be on the order of petabytes in size, and the models themselves are also hundreds of gigabytes in size. This means that even the model itself won’t fit in memory on a standard GPU chip. Efficient and intelligent parallelization, as well as recoverability, is paramount in the world of deep learning.

Some recent literature has focused on infrastructure provisioning for systems focused on LLMs. If you know VAST, you know this is right up our alley, as we love to talk about not just data infrastructure but all of the amazing ways you can push technology to get the most value out of your data. So without further ado, we’d like to give you our take on how parallelism affects checkpoint and restore operations in the most complex models today.

Parallelizing in All Dimensions

For the large datasets and large models inherent in LLMs and other massive-scale deep learning algorithms, both the data and the models themselves are too large to fit into memory. As an example, a typical LLM with its billions of parameters will not fit in memory. GPT-3 is >500 GB in size, and a typical GPU is limited to 80 GB of VMEM. Furthermore, a single A100 GPU will take several hundred years to train GPT-3 (300+ years actually). Therefore, multi-dimensional parallelism is critical to training and fine-tuning models.

This argument is based on extensive research done in the field, in particular the seminal paper Large-Scale Training with Megatron-LM by Stanford, NVIDIA, and Microsoft Research. The authors suggest - and it has been confirmed in the field - that a synthesis of three types of parallelism allows for a much more manageable and recoverable workload in LLMs:

Data parallelism: The entire model is replicated across multiple GPUs or CPUs and distributes the training data among them. This is the simplest and most common type of parallelism, but for large models it is typically extremely memory-intensive.
Model parallelism: The model itself is sharded into discrete layers, or tensors, and then distributed across multiple GPUs or CPUs. This can be rather complex to implement, but is more memory-efficient than data parallelism.
Pipeline parallelism: The model training process is split into smaller steps and executed on different GPUs or CPUs. This can increase latency or effectively serialize the model, but can improve training throughput if done well.

By combining the three main types of parallelism, the overall model training performance can be increased by orders of magnitude.

Checkpoints and Recoverability

Once a model is parallelized it still may take a month or more to fully run a training job. Therefore, recoverability of that model’s execution becomes a critical consideration, and regular checkpoints should be taken of the state of the system. Typically, checkpointing is done after every epoch of training (that is, one complete pass through the training dataset). For most LLMs, each token is seen only once by the training process - so training happens in roughly 1 epoch - so checkpoints are done after a certain set number of training steps. Then, for example, if a model fails midway through the training process, rather than lose the progress to that point it is far preferable to restore to a checkpoint.

Alternatively, sometimes it’s necessary to go back and change parameters midway through a model’s progress. Checkpointing would allow for that change without requiring a full model run from the start. Moreover, checkpointing is essential to the repeatability of models; being able to re-run a model training to show compliance or reproduce results reliably will no doubt become more important as large-scale models continue to become more mainstream.

For that reason, it is incredibly important to ensure an AI architecture is designed to adequately allow for checkpoint operations. Note that AI models themselves aren’t typically I/O-bound, they remain GPU-bound. The checkpoint process, however, is write-intensive, and the restore process is even more read-intensive. In reality, the I/O requirements for AI architectures are directly related to checkpoint and restore operations.

Checkpoint sizing: let’s talk numbers

Sizing of infrastructure has been a frequent conversation topic in recent years with the onslaught of AI-focused technologies being released to the public. As such, GPT-3 is the perfect example to use when discussing how to adequately provision infrastructure for deep learning generally, and LLMs in particular. Let’s walk through this example, doing the math on how to size such a system:

GPT-3 has 175 billion parameters, and let’s assume that we’ll use all three types of parallelism discussed above in our deployment of this LLM, and that we’ll deploy it on 1024 GPUs, which is the equivalent of 128 NVIDIA DGX-A100 machines (each contains 8 GPUs).
As discussed above, the ~500 GB model itself is too big to fit into the 80 GB memory of a single GPU, so we’ll use tensor-based model parallelism to distribute the model across the 8 GPUs in a DGX node.
To implement pipeline parallelism, we’ll replicate the model across groups of 8 DGXs, or octets. Note that the Pipeline Parallel octet described here is self-contained and includes the entire model and pipeline for an LLM, and hence only one octet needs to be checkpointed per system, regardless of the overall size of the cluster because it is a complete representation of the system.
Each GPU therefore is provisioned as 1 thread, 8 threads per 8-GPU DGX, and 64 threads per 8-DGX octet. This results in buffered sequential large-block write operations writing to 1 checkpoint file per thread.
- Note: This is often a point of confusion as a common erroneous assumption is that every GPU in the cluster needs to be checkpointed. Not so - as shown there is one checkpointed octet in this example system.
Lastly we parallelize the data into 16 groups of 8 pipeline-parallel systems (128 nodes).

The combination of tensor model, pipeline and data parallelism allows linear scale in the Megatron pattern up to 1 Trillion parameter size models, with flop efficiency close to 50% of theoretical.

The bottom line of this example is that the size of a checkpoint doesn’t depend at all on the size of the data or the number of GPUs, only on the size of the model. When provisioning infrastructure for LLMs, the only consideration regarding checkpointing is to understand the size of the models you’ll be deploying, and then to make sure there’s enough bandwidth to write checkpoints and read a checkpoint as the I/O bounds.

An even nerdier deeper dive into checkpoint math

When you’re a secret academic nerd like us (seriously, we both have PhDs in Physics), it’s fun to dig in even deeper to understand the science behind how to checkpoint and restore LLM training systems, and of course we want to bring some data to the discussion. The table below shows a linear correlation between model size and checkpoint state size for three popular LLMs:

Note: GPT is a class of models, and the numbers above are from the academic paper referenced above.

Clearly the checkpoint size is directly correlated to model size, and it turns out the requirements are a straightforward calculation in the end. This checkpoint state size is dependent only on model size – not the size of the dataset, checkpoint frequency, or number of GPUs to train on.

Detailed requirements and timing for a checkpoint operation

Let’s return to the paper we referenced above, which shows an example of training with 1 trillion parameters with 3072 GPUs (384 DGXs):

When reading a checkpoint for restoring the system to a prior state, the system is rate limited as the storage parallel file system could only deliver 1 TB/s of read bandwidth. The write performance to checkpoint 13.8 TB, however, was clocked at 273 GB/s; the write process could only reach 40% of the available write bandwidth from the storage subsystem. This is limited because each DGX can only deliver around 8 GB/s as only 8 threads are used per DGX system. Incidentally, this is one of the tests we went through when receiving our NVIDIA SuperPOD certification – Buffered Writes vs Thread Count. A measure of the local NVMe burst buffer yields this result as well.

How long a system takes to write a checkpoint is a key consideration when provisioning infrastructure. In this continued example, the required bandwidth for the 512 GPUs that participate in the checkpoint write process is 0.53 GB/s/GPU or roughly 4.3 GB/s/DGX – even for the largest model examined in the paper. This results in a checkpoint time of 50 seconds for GPT-3, which is completely reasonable. NVIDIA has recommended checkpointing every 4 hours – this would constitute 0.3% overhead due to checkpointing. NVIDIA guidance is that this overhead be < 5% or so, hence even lower write bandwidths are sufficient to be within this tolerance.

As a reference point, CoreWeave has peak checkpoint write rates of 50 GB per second.

How long does it take to run a restore operation?

The read bandwidth needed to restore will be the data parallelism (6 in this example) multiplied by the writes, so 6x273 GB/s= 1.638 TB/s. The write bandwidth is 17% of the read Bandwidth required for restore. As long as a storage subsystem can deliver 1.64 TB/s for reads and 280 GB/s for writes, the checkpoint time will be optimal – around 50 seconds for a 13.8 TB model state. This is an entirely reasonable time, and maybe even overkill, for recovering a massive model to a previous stable state.

The Checkpoint on Checkpoints

We've seen discussions in the market recently that states a system needs 1 GB/s per GPU for all GPUs in the cluster for checkpoints. As is hopefully clear now, that's not based on realistic calculations. As we've shown above, while each checkpoint may need ~1 GB/s/GPU, this is not required for every GPU that participates in the training, but only those in a pipeline parallel set. For the example of a trillion-parameter model, 273 GB/s of write bandwidth really translates to just 0.53 GB/s per GPU (since there are 512 GPUs running in parallel, or 64 8-GPU servers). Therefore, not every GPU needs to checkpoint at the same time or needs that full write bandwidth; this avoids any over-provisioning of storage hardware for such training clusters.

Remember that a checkpoint doesn’t need to reflect absolutely everything in the system; it just needs to be a coherent record of the state of a system from which you can restore that system. In other words, you need to take a snapshot of the model and the pipelines in a consistent state. It’s important to recognize that this doesn’t depend on the data, only the weights and biases present in the current state of the model – in other words, the size of the model itself.

Conclusion: Math and Science are Cool

In the end, our advice to you is this: be sure you’re relying on math and science to drive considerations when building out an AI system. Given the sizes and scales we’re talking about here, small mistakes can be incredibly costly. Be sure to rely on real technical knowledge and calculations rather than back-of-the-napkin math or biased guidelines with ulterior motives.

Interested in learning more or discussing this further? We’d love to hear from you! Contact us and we’ll be happy to help you build out the right-sized AI system that will help you scale your business and its data.