product

Feb 20, 2024

Accelerate Your Trino Workloads with VAST

Authored by

Christian Neundorf, Senior Systems Engineer

The VAST Data Platform revolutionizes data infrastructure with a scalable software solution that unifies storage, database, and compute services to power data-intensive AI applications.

The platform integrates unstructured data management with a structured data environment that allows organizations to activate all data and to generate superior insights and unlock new value. In this blog we’ll discuss how the VAST DataBase provides structure to unstructured data, and we’ll demo how the VAST DataBase works with the Trino query engine to deliver drastic performance improvements.

VAST DataBase: The Best of All Worlds

The VAST DataBase serves as both a transactional and analytical database and targets performance, scalability, and cost efficiency for tabular data at any scale. With the performance of a data warehouse and the scale and cost efficiency of a data lakehouse, the VAST DataBase is key to accelerating time-to-value for structured data.

Beyond its core advantages, the VAST DataBase is feature-rich, supporting:

schema evolution
data protection and versioning via snapshots
efficient data management
easily access and delete individual records to ensure compliance (e.g. GDPR)
multi-table transactions

The VAST DataBase also powers the VAST Catalog for comprehensive data discovery and advanced insights. Ultimately, the VAST DataBase - by being part of the same platform that includes the VAST DataStore (for unstructured data storage) - enables data engineers to simplify their data pipelines, which accelerates the time-to-value for data by reducing replication, the number of pipelines, and cognitive load on teams.

VAST + Trino: Better Together

The VAST DataBase can be queried by the most popular engines, such as Spark or Trino. Trino is an open-source distributed SQL query engine designed for high-performance interactive analytics on various data sources. Trino can scale out horizontally, allowing organizations to distribute query processing across a cluster of machines. This scalability is crucial when dealing with large volumes of data typically found in data lakes. By distributing the workload, Trino can handle complex queries on massive datasets efficiently.

The VAST DataStore can also be queried from Trino (or Starburst, an enterprise version of Trino) using the Hive connector and the S3a/S3 protocol to access files/objects (csv, parquet, orc, json..) directly. VAST customers are already gaining a huge performance benefit over traditional data lake stores using HDFS or disk-based object stores.

However, to get an even more significant performance improvement, try ingesting the data into the VAST DataBase and use the VAST Trino connector to query the data. VAST provides a parallel connector for Trino and Starburst with the support for query pushdowns for efficient data transfer and query acceleration. Trino is well-regarded as a top of the line, performance-focused query engine. Coupling Trino with the VAST DataBase compounds the best of both worlds.

The VAST Advantage

Using the VAST DataBase connector with Trino provides the following advantages:

Faster query response times support faster decision-making
Increased concurrency, beating query response time SLAs
Cost optimization and reduction in footprint of the compute infrastructure

The VAST Trino connector uses Trino's DML to insert, update, and delete rows and columns. The VAST DataBase stores and updates data in columns in small 32KB chunks. This provides radical acceleration versus standard Parquet files (128MB - 1GB) in an S3 bucket. This makes the VAST DataBase a great fit for streaming use cases.

The below video walks through an example, and we also describe the use case below as well.

Loading Data into the VAST DataBase

In this video example, we have some parquet files that contain NYC taxi data stored in an S3 bucket on the VAST DataStore. To make the data available on S3/S3a for Trino, we would have to define a schema on read in our Hive metastore (i.e. s3 location for our parquet files, column names, data types, etc.). As soon as we have the schema defined, we can start running SQL queries on the files directly from Trino. This way we can quickly explore different file types like json, csv, parquet, and orc.

As you can see, while Trino directly on S3 works from a functional perspective, it requires Parquet file and metastore management and lacks the performance acceleration and features of a full-fledged database environment. Using the VAST DB takes advantage of additional performance optimizations to accelerate queries significantly, producing faster time-to-value for structured data. To show this improvement in query response time for our data consumers, we will import the data into the VAST Database. This is a simple operation that will have huge performance benefits.

Start by creating a schema where we want to store our tables:

We will increase the number of writer tasks to benefit from the parallel VAST DB connector and to improve the number of row inserts:

The following CTAS statement will create a VAST DataBase table in our nyc_taxi schema and we will read and copy the data directly from the VAST DataStore.

Tip: The same import procedure can be used to get data from other Trino supported data sources like RDBMS, NOSQL into the VAST DataBase.

In our lab environment we have 2x Trino/Starburst worker nodes connected to an entry-level VAST system. With this setup, we are able to ingest around 19 million rows per second.

Accelerating Query Performance with the VAST DataBase

We executed the following query on the same table (1.4 billion rows) to compare the differences between the VAST S3 DataStore and the VAST DataBase.

VAST Datastore (S3):

On the VAST S3 Datastore we had to scan the full 1.4 billion rows and around 30GB per dataset.

CPU Time: 1m 24s

VAST DataBase:

On the VAST DataBase we only had to scan 12.5 million rows and around 60 MB as we were able to push down the filters into the database. In other words, the query execution was faster on the VAST DataBase as we had to transfer 500x less data back to Trino, so there was significantly less CPU time required.

CPU time: 56 seconds

We will provide another blog around benchmarking on larger datasets and comparison with other systems which will highlight the performance advantages even more.

Summary

With its data warehouse-like feature set, fast ingestion, and accelerated read performance, the VAST DataBase is enabling data pros to rethink how structured data can be queried at scale. With a focus on cost efficiency and scalability, the VAST DataBase lowers TCO for data by requiring even less compute than today’s best query engines alone. The combination of Trino and VAST can provide organizations with a powerful platform for performing analytics on large and varied datasets up to exabytes, and reduce the overall cost per query significantly.

Interested in hearing more? Contact us!