Creating Robust Data Pipelines for AI with VAST Data

Author

Subramanian Kartik, PhD, Global Systems Engineering Lead and Dave Graham, Technical Marketing Manager

The general perception about artificial intelligence (AI) is that it’s all about GPUs, but that assumption is only a fraction of the whole story. Foundation model training is GPU intensive, of course, and, depending on the amount of data and size of the model, can consume thousands of GPUs and GPU hours as well as enormous amounts of power. But this is only a piece of the processing needed to build robust data pipelines and transform raw data into applicable Generative AI models.

Today, we’ll study the heavy lifting outside the GPU clusters: how raw data is processed, cleaned, and fed into the model training infrastructure, as well as the processing that follows. Each step is performed securely while keeping track of data provenance and governance.

Understanding Data Pipelines in AI

In AI, a data pipeline is defined as: the processes and transformations data undergoes from its raw state to a refined form, preparing it to train AI models before moving on to fine-tuning, quantization and inferencing, with and without RAG/RLHF. This article focuses on the data preparation and training phases.

This specific segment of the journey involves several stages, including data ingestion, cleaning, transformation and “tokenization,” before finally feeding the data into the AI models. This may be done in batches (as performed in historical extract, transform, load - or ETL - processing for databases), but data pipelines are typically streaming processes, not designed around a fixed point in time. Downstream of the model training, other critical aspects such as fine-tuning, quantization, retrieval augmented generation (RAG), etc. are essential to reaching the point where the proverbial rubber meets the road - in inferencing. All stages consume data and all stages move data. As renowned Stanford Professor Andrew Ng states so aptly:

Instead of focusing on the code, companies should focus on developing systematic engineering practices for improving data in ways that are reliable, efficient, and systematic. In other words, companies need to move from a model-centric approach to a data-centric approach.
- Andrew Ng, CEO and Founder of LandingAl

Critical Components of a Data Pipeline

People often fixate on expensive GPUs for training but they’d be remiss not to remember that 80% of training time is spent processing raw data from sources like internal corporate data, internet data, Github, Arxiv, and PubMed (amongst others). This process involves refining the data until the correct content is in the right format for the model we want to train. All phases in the pipeline shown above consume and generate data; moving this data takes a significant amount of time. VAST provides a common namespace for all phases in the data pipeline, eliminating the need to move data from one phase to another. This initial data preparation is highly iterative and often requires returning to previous steps for precise refinements. Eliminating data movement enables predictability for model training, reducing overall training times and costs.

A Practical Example: Training GPT Models

To illustrate the effectiveness of the VAST Data Platform with AI data pipelines, consider the example of training large language models (LLMs) like Generative Pre-trained Transformer (GPT). Training such models involves processing vast amounts of text data, requiring efficient data ingestion, cleaning, transformation, and storage. This pipeline was partly inspired by the RefinedWeb pipeline used in building tokens for the TII/Falcon-40B LLMl, one of the most successful and influential models on HuggingFace.

In the example below, we will walk through a sample data pipeline and highlight the data transformation process.

A Step-by-Step Pipeline with VAST Data

1. Data Ingestion: Raw data from CommonCrawl, maintained on AWS, is ingested using VAST Data’s high-throughput storage layer using native S3 tools to ensure quick and efficient data collection. CommonCrawl is a long-running project that scrapes the Internet and curates raw HTML dumps of the pages it finds. It is the most common starting point for curating training datasets for LLMs.

2. Data Cleaning: The raw data is then converted to Parquet format. The raw HTML data is parsed to clean the content, leveraging tools like BeautifulSoup to extract useful training text from the markup scraped at ingestion. The VAST Data Platform allows for the persistence of data using records in the VAST DataBase. By creating intermediate states of data processing, iterative cleaning processes can take place without starting from scratch. This processing uses Spark, leveraging the Spark Connector to connect to the VAST DataBase.

3. Data Transformation: The data is transformed by performing several operations, such as removing blacklist words, XML tags, cookies, drop-down menus, etc. Duplicate text records are also discarded. Finally, the language is identified and stored as a separate column in the database. That way, selecting a specific language is as simple as a predicate request.

4. Data Training: The final stage is to extract the text records to train on and convert them into tokens. Here, the tokenization and training are done for GPT models using the Megatron-LM repository from NVIDIA on GitHub.

VAST's Approach to Data Pipeline Management

The VAST approach integrates several innovative features to address the unique challenges posed by AI data processing.

High-Performance Storage

The foundation of the VAST Data Platform is a single tier of flash infrastructure known as the VAST DataStore that handles the immense data throughput required for AI applications. It provides a global namespace for all data with support for multiple front-end protocols like NFS and GPUDirect to ensure seamless access and high-speed retrieval. The platform’s architecture allows efficient multiprotocol access for structured and unstructured data storage, making it ideal for diverse AI workloads.

Advanced Data Reduction

One of the standout features of VAST’s solution is its Similarity data reduction. By employing fine-grained data deduplication, VAST can significantly reduce the storage footprint of AI datasets. This lessens storage costs and accelerates data processing by minimizing the data needed at each stage in the data pipeline. More information on VAST Data’s Similarity data reduction can be found here.

Seamless Scalability

AI projects often require scalability to accommodate growing datasets and increasing computational demands. VAST's Disaggregated, Shared Everything (DASE) architecture is built to scale effortlessly, allowing storage and processing capabilities to expand independently and eliminate performance, availability, and capacity constraints. AI models can then be trained on larger datasets without compromising performance.

Data Platform and Governance Capabilities

Most of the processing in this example is done by heavily leveraging the VAST Data Platform. This includes the VAST DataBase, Spark connectors, Apache Arrow support, and the VAST Python SDK, which provide a unique and flexible platform for rapid, iterative data pipeline development. The robust security features - such as multitenancy, encryption, full auditing, immutable snapshots, and metadata tagging - enable a compelling ecosystem for the generative AI world.

Closing the Loop

In the realm of AI, the quality and efficiency of data pipelines are crucial. Through every stage of the data pipeline, the VAST Data Platform offers high-performance storage capabilities, enabling seamless access and retrieval of data essentials for AI model development and training. Platform features like similarity-based data reduction ensure optimal resource utilization for faster processing, maximizing the efficiency and speed of AI workflows across all stages of a data pipeline.

In addition, the VAST platform is characterized by seamless scalability through DASE, allowing for the effortless expansion of data infrastructure in alignment with the evolving needs and demands of AI projects. This scalability enables data scientists to manage and process exponentially growing datasets without compromising performance or reliability.

VAST also emphasizes enhanced data governance, ensuring that AI practitioners can maintain strict control and compliance over their data throughout the entire processing lifecycle. By adhering to robust data governance practices, VAST empowers organizations to uphold data integrity, security, and regulatory compliance, fostering trust and reliability in AI-driven decision-making.

In conclusion, whether immersed in developing LLMs or exploring other diverse AI applications, VAST offers an extensive suite of tools and capabilities that cater to the intricate requirements of AI development and implementation. This presents a compelling opportunity for those seeking success in their AI endeavors. At every stage of this pipeline, VAST keeps data, above all else, at the center of the AI ecosystem. To get a sense of how we can operationalize this for you, join our virtual Cosmos event on October 1st and 2nd, 2024 or catch us on our road tour. Register now and join the next great adventure in AI.