October turned into a big news month here at VAST. First, we announced the next step in the evolution of the VAST Data Platform with an AI workflow automation system co-developed with NVIDIA (VAST InsightEngine with NVIDIA). We also introduced the Cosmos AI Community to give practitioners a place to make connections and talk shop. Then, the good folks at Gartner® named us as a leader in the Magic Quadrant™ for File and Object Storage Platforms.
While those of us in VAST marketing thought that was enough and we could rest for a minute, the boffins in engineering reminded us that we had a new version of VAST’s software (v5.2) with significant enhancements across the platform.
Bringing VAST to Hyperscale with EBoxes
The most visible new feature in VAST 5.2 is support for a new hardware configuration we’re calling the EBox, or Everything Box. Each x86 EBox runs a CNode container, that serves user requests and manages data just like a dedicated CNode would, and DNode containers, that connect the EBox’s SSDs to the cluster’s NVMe fabric. Just like in a VAST cluster with CBoxes and DBoxes, every CNode in the cluster mounts every SSD in the cluster.
The EBox architecture lets us run the VAST Data Platform in environments that, until now, didn’t want, or couldn’t use, highly available DBoxes. These include hyperscalers that have thousands of a very specific server configuration and cloud providers that only offer virtual machine instances. It also allows us to work with companies like Supermicro and Cisco to deliver the VAST Data Platform to customers using servers from those vendors.
EBox clusters use the DBox-HA data layout and can therefore continue running through EBox failures with a minimum of 11 EBoxes per cluster. We’re also bringing the EBox architecture to the public cloud in 5.2, with fully functional VAST Clusters on the Google Cloud Platform. We’ll have more details in an upcoming EBox blog post.
Doubling Write Performance, Again
As our fearless founder Jeff Denworth blogged in March, we’ve been paying special attention to write performance in the last few versions of the VAST software. In 5.1 we switched from mirroring the write buffer in SCM to double parity erasure codes. That change, along with a few other optimizations, just about doubled write performance.
In 5.2, we’re taking advantage of the fact that there are many more capacity (QLC) SSDs than SCM SSDs by directing large bursts of writes, like AI models’ dumping checkpoints, to a section of QLC.
Writing to the SCM and QLC in parallel approximately doubles write performance again, and since we’re only sending bursts of large writes to a small percentage of the QLC in a cluster, the flash wear impact is insignificant. Altogether, we’ve managed a 4X improvement in write performance, and since that all came from the software, every VAST customer will see a performance boost without waiting for the next hardware release.
Synchronous Replication for Active-Active Clusters
For those applications where you can never lose data, only the gold standard of synchronous replication will do. Once a pair of VAST clusters are configured to synchronously replicate an S3 bucket, any data written to that bucket, at either cluster, will be replicated to the other cluster in the pair and acknowledged by the remote cluster before it is acknowledged to the client.
Cloud providers can use a synchronously replicating cluster pair to provide regional, as opposed to zonal, availability to their object storage offerings, while enterprise customers can engineer 100% uptime into their applications even when their data centers only provide 5 9s.
Database Replication
VAST 5.2 also expands VAST’s native asynchronous replication to support VAST’s native tables. When a VAST cluster is configured to replicate folders containing VAST Tables, those tables will be replicated between the two clusters with full transactional consistency.
Traditional database systems running on traditional storage can take snapshots of the volumes or files that hold a database’s tables, but database software caches data and updates in memory rather than committing all changes to disk in order. This means the data in those files isn’t internally consistent, as only some table updates have been posted to storage, and others have only been posted to the in-memory portion of the database. We euphemistically call these database snapshots “Crash Consistent” because the data in those snapshots is only as consistent as it would have been if the database server crashed.
To get a snapshot of a traditional database in a consistent state takes a little coordination between the database server and storage system. A script or the Windows VSS (Volume Shadow copy Service) quiesces the database, flushing updates from memory to storage before taking the snapshot.
As a single unified data platform, VAST integrates the database management, snapshot, and replication processes into a coherent whole. When a VAST cluster takes a snapshot of a folder containing VAST Tables, the snapshot contains a consistent view of those tables. Any transactions that were complete at the snapshot time will be included in the snap, while updates from transactions that were in progress when the snapshot was taken will not be reflected or replicated to a remote site.
S3 Event Publishing
The VAST DataEngine will provide VAST users with all the tools they need to implement event-driven workflows that automatically run functions based on events like changes to the contents of a bucket or folder. In 5.2, we deliver the first step in that workflow automation with S3 event publishing.
When VAST customers configure event publishing on one or more of their folders, their VAST cluster will send an entry to a specified Kafka topic. In version 5.2, that topic must be on an external Kafka cluster, and the functions mut subscribe to the Kafka topic. Over the next few quarterly releases, the VAST DataEngine will add a Kafka API-compatible event broker and the functionality to process data.
Global Namespace Cache Control
VAST 5.1 introduced the VAST global namespace, which lets VAST customers present Global Folders across multiple VAST clusters with full read-write access at the core, edge, and in the cloud—with strict consistency to ensure applications across the world are getting the latest data. For each Global Folder, the origin cluster holds a full copy of the folder’s data, and satellite clusters use their local SSD capacity to cache the data as it is accessed.
Caching data as it is accessed ensures that only the data that’s accessed by users or applications at that satellite is transferred. While that’s efficient, it does mean that applications have to wait for data to be transferred over the WAN link between the clusters.
VAST 5.2 adds some control over how satellite clusters fetch data from the origin. A VAST admin can force a satellite to prefetch all metadata changes or all data and metadata changes to a folder. When a VAST customer has a workflow with one stage on-premises and a follow-on stage in the cloud, they can set the folder on their cloud cluster to prefetch data, eliminating the cache warming delay.
Gartner, Magic Quadrant for File and Object Storage Platforms, By Chandra Mukhyala, Julia Palmer, Jeff Vogel, 8 October 2024
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, MAGIC QUADRANT is a registered trademarks of Gartner, Inc. and/or its affiliates and are used herein with permission. All rights reserved.
Gartner does not endorse any vendor, product or service depicted in this content nor makes any warranties, expressed or implied, with respect to this content, about its accuracy or completeness, including any warranties of merchantability or fitness for a particular purpose.