VAST Data’s Universal Storage is a single storage system that is fast enough for primary storage, scalable enough for huge datasets and affordable enough to use for the full range of a customers data, thus eliminating the tyranny of tiers. redefines the economics of flash storage, making flash affordable for all applications, from the highest performance databases to the largest data archives, for the first time. The Universal Storage concept blends game-changing storage innovations to lower the acquisition cost of flash with an exabyte-scale file and object storage architecture breaking decades of storage tradeoffs.
With the advantage of new, enabling technologies that weren’t available before 2018, this new Universal Storage concept can achieve a previously-impossible architecture design point. The system combines low-cost QLC flash stores four bits in each flash cell as 16 different charge, or voltage, levels. Compared to less dense flash technology like MLC (2 bits/cell) and TLC (3 bits/cell) QLC flash has lower cost/bit and lower endurance. Drives and 3D Xpoint is a new non-volatile memory technology that has much lower latency and much higher endurance than NAND Flash. Optane is Intel’s trade-name for SSDs using 3D Xpoint as their storage media. VAST Universal Storage Systems use Optane SSDs to store system and element store metadata as well a... More memory (such as Intel 3D Xpoint is a new non-volatile memory technology that has much lower latency and much higher endurance than NAND Flash. Optane is Intel’s trade-name for SSDs using 3D Xpoint as their storage media. VAST Universal Storage Systems use Optane SSDs to store system and element store metadata as well a... More) with stateless, containerized storage services all connected over new low-latency NVMe is a new protocol designed specifically to provide systems access to non-volatile memory devices like SSDs. NVMe has much lower overhead, and allows much more parallel I/O than the older SCSI protocol. over Fabrics networks to create VAST’s The VAST DASE (Disaggregated Shared Everything Architecture) disaggregates (separates) the storage media from the CPUs that manage that media and provide storage services. This disaggregated storage, including all the system metadata, is shared by all the VAST Servers in the cluster. DASE allows use... More scale-out architecture. Next-generation global algorithms are applied to this The VAST DASE (Disaggregated Shared Everything Architecture) disaggregates (separates) the storage media from the CPUs that manage that media and provide storage services. This disaggregated storage, including all the system metadata, is shared by all the VAST Servers in the cluster. DASE allows use... More architecture to deliver new levels of storage efficiency, resilience, and scale.
While the architecture concepts are sophisticated, the intent and vision of Universal Storage are simple: to bring an end to the data center HDD era and end the complexity of storage tiering that is a byproduct of the decades of compromises caused by mechanical media. This White Paper will introduce you to the VAST Data’s Universal Storage and the DASE architecture and explain how this new architecture defies all conventional definitions of storage. In breaking the classic price/performance tradeoff, this system features all-flash performance at archive economics to simplify the data center and accelerate all modern applications.
Why Universal Storage?
The Tyranny of Tiers
Over 30 years ago, Gartner introduced the storage tiering model as a means to optimize data center costs by advising customers to deprecate older and less-valuable data to lower-cost (and slower) tiers of storage. Fast forward 30 years and the sprawl of storage technologies within organizations has grown to unmanageable proportions – where many of the world’s largest companies can be found managing dozens of different types of storage. This problem is exhibited when defining both storage class (for example: all-flash, hybrid, all-HDD, tape) as well as by classes of protocols (block, file, object, big data, etc.)…. all of it creates a complex pyramid of storage technologies.
While the savings are clear when applying this model with legacy storage architectures, the idea that data should exist on a specific storage tier according to its current value creates multiple challenges:
The Demands of Artificial Intelligence Render Storage Tiering Obsolete
Arguably the greater problem with storage tiering is that this concept assumes that the applications accessing data enjoy a narrow and predefined view of their data access requirements. While that’s true for some applications, such as traditional database engines, new game-changing AI and analytics tools, such as machine learning and deep learning, see value in all data and want the fastest access to the largest amounts of data. For example, when a deep learning system trains it’s neural network model for facial recognition, the model becomes more accurate only once it’s run against all the photos in the dataset, not just the 15-30% that may fit in some expensive flash tier. The value these applications bring is proportionate to the corpus of data they get exposed to, where they thrive with large data sets.
Defining Universal Storage
Universal Storage is a next-generation, scale-out file and object storage concept that breaks decades of storage tradeoffs, and in so doing defies classical storage definitions. Universal Storage is:
New Technologies Lay A New Storage Foundation
There are points in time where the introduction of new technologies make it possible to rethink fundamental approaches to system architecture. In order to realize the Universal Storage architecture vision, VAST made a bet on a trio of underlying technologies that were not available to previous storage architecture efforts, and in fact, only all became commercially viable in 2018. These are:
For the first time in 30 years, a new type of media has been introduced into the classic media hierarchy. 3D XPoint is a new persistent memory technology that is both lower-latency and more endurant than the NAND flash memory used in SSDs while retaining flash’s ability to retain data without external power persistently.
Universal Storage systems use 3D XPoint both as a high-performance write buffer to enable the deployment of low-cost QLC flash stores four bits in each flash cell as 16 different charge, or voltage, levels. Compared to less dense flash technology like MLC (2 bits/cell) and TLC (3 bits/cell) QLC flash has lower cost/bit and lower endurance. flash for the system’s data store, as well as a global metadata store. 3D XPoint was selected for its low write latency and long endurance. A Universal Storage cluster includes tens to hundreds of terabytes of 3D XPoint capacity, which provides the VAST DASE architecture with several architectural benefits:
NVMe over Fabrics
NVMe is a new protocol designed specifically to provide systems access to non-volatile memory devices like SSDs. NVMe has much lower overhead, and allows much more parallel I/O than the older SCSI protocol. is the software interface that replaced the SCSI command set for accessing PCIe SSDs. Greater parallelism and lower command queue overhead make NVMe SSDs significantly faster than their SAS or SATA equivalents.
NVMe over Fabrics (NVMe-oF (Non Volatile Memory express over Fabrics) extends the NVMe protocol to allow systems to access NVMe SSDs over fabric networks like Ethernet and/or Infiniband as well as PCIe connections.) extends the NVMe API over commodity Ethernet and Infiniband networks to provide PCI levels of performance for remote storage access at data center scale. VAST’s DASE architecture disaggregates CPUs and connects them to a globally accessible pool of 3D Xpoint and QLC Flash SSDs to enable a system architecture that independently scales controllers from storage and provides the foundation to execute a new class of global storage algorithms with the intent of driving the effective cost of the system below the sum of the cost of goods. With NVMe-oF, VAST Containers enjoy the advantage of statelessness and shared-everything access to a global pool of 3D XPoint and Flash, with direct-attached levels of storage access performance.
The logic of VAST’s Universal Storage cluster runs in stateless containers. Thanks to NVMe-oF and NVMe Flash and 3D XPoint, each container enjoys direct-attached levels of storage performance without having any direct-attached stateful storage. Containers make it simple to deploy and scale VAST as a software-defined microservice while also laying the foundation for a much more resilient architecture where container failures are non-disruptive to system operation.
Quad-Level Cell Flash (QLC) is the fourth and latest generation in flash memory density and therefore costs the least to manufacture. QLC stores 33% more data in the same space than Triple-Level Cell (TLC). Each cell in a QLC flash chip stores four bits, requiring 16 different voltage levels.
While QLC brings the cost per GB of flash down to unprecedentedly low levels, squeezing more bits in each cell comes with a cost. As each successive generation of flash chips reduced cost by fitting more bits in a cell, each generation also had lower endurance, wearing out after fewer write/erase cycles. The differences in endurance across flash generations are huge – while the first generation of NAND (SLC) could be overwritten 100,000 times, QLC endurance is 100x lower.
Erasing flash memory requires high voltage that causes damage to the flash cell’s insulating layer at a physical level. After multiple cycles, enough damage has accumulated to allow some electron leakage through the silicon’s insulating layer. Since the spatial differences between levels in a QLC flash cell are 1/16th (or less) as big as in an SLC cell, there is simply less real estate for insulation – it’s, therefore, easier for electrons to leak and change the state of a QLC cell from one value to another, hence QLC’s lower endurance.
Several flash vendors have now started talking about, an even denser, 5-bit per cell flash generation. While designs are preliminary this new PLC (Penta Level Cell) flash will store 25% more data per cell than QLC and is projected to have endurance of only a few hundred write/erase cycles per cell, which the VAST architecture is also designed to accommodate.
VAST’s Universal Storage systems were designed to minimize flash wear by both using innovative new data structures that align with the internal geometry of low-cost QLC SSDs in ways never before attempted and a large 3D XPoint write buffer to absorb writes, providing the time, and space, to minimize flash wear. The combination allows VAST Data to warranty QLC or PLC flash systems for 10 years, which has its own impact on system ownership economics.
Scale-out Beyond Shared Nothing
For the past 10 years, the storage industry has convinced itself that a shared-nothing storage architecture is the best approach to achieving storage scale and cost savings. Following the release of the Google File System architecture whitepaper in 2003, it became table stakes for storage architectures of almost any variety to be built from a shared-nothing model, ranging from hyper-converged storage to scale-out file storage to object storage to data warehouse systems and beyond. 10 years later, the basic principles that shared-nothing systems were founded on are much less valid for the following reasons:
The DASE Architecture
VAST Universal Storage is based on a new scale-out architecture concept consisting of two building blocks that are scaled across a common NVMe Fabric. First, the state (and storage capacity) of the system is built from resilient, high-density NVMe-oF storage enclosures. Second, the logic of the system is implemented by stateless docker containers that each has the ability to connect to and manage all of the media in the enclosures. Since the compute elements are disaggregated from the media across a data center scale Fabric, each can scale independently – thereby decoupling capacity and performance.
In this The VAST DASE (Disaggregated Shared Everything Architecture) disaggregates (separates) the storage media from the CPUs that manage that media and provide storage services. This disaggregated storage, including all the system metadata, is shared by all the VAST Servers in the cluster. DASE allows use... More (DASE) architecture, every VAST Server in the cluster has direct access to all the cluster’s storage media with PCI levels of low latency.
VAST Servers provide the intelligence to transform enclosures full of 3D XPoint and QLC SSDs into an enterprise storage cluster. VAST Servers serve file and object protocol requests from NFS and S3 clients and manage the global namespace, called the The VAST Element Store defines how VAST Universal Storage Systems store files and objects and the metadata that describes them. The Element Store is neither a traditional file system nor an object store, abstracting both to create an abstraction that serves both the hierarchical presentation of a fi... More.
The VAST Server Operating System (VASTOS), provides multi-protocol access to the VAST Element store by treating file and object protocols as interchangeable peers. Clients can write a file to an NFS mount and read the same data as an object from an S3 bucket (and vice-versa). VASTOS today supports NFS v3 (with optional extensions for RDMA (NFSoRDMA)) and the de facto cloud standard S3 object storage protocol. Support for Server Message Block (SMB) is under development and will be made available for release in early 2020. Each server manages a collection of virtual IP addresses (VIPs) that clients mount via round-robin DNS services to balance load across the cluster.
All the VAST Servers in a cluster mount all the storage devices in the cluster via NVMe-oF, providing global and direct access to all the data and metadata in the system. With this global view, VASTOS distributes data management services (erasure encoding, data reduction, etc) across the cluster’s CPUs so that cluster performance scales linearly as more CPUs are added.
VASTOS is deployed in stateless Docker containers to simplify software updates and cluster management across the VAST Server appliances. Containerization of VASTOS provides the benefit of abstracting the cluster’s logic from the underlying server hardware, as such the DASE cluster is designed to support flexible deployment models, including:
Options 2 & 3, today, require VAST to qualify and approve a customer’s operating environment.
The Advantage of a Stateless Design
When a VAST Server receives an NFS or S3 read request, the Server accesses persistent metadata that is housed in 3D XPoint across the Fabric in order to locate a file or object’s data, and then reads data from QLC flash (or XPoint if the data has not yet been migrated from the buffer) before forwarding the data to the client. For write requests, the VAST Server writes both data and metadata directly to multiple XPoint SSDs before acknowledging writes. This direct access to shared devices over an ultra-low latency fabric eliminates the need for VAST servers to talk with each other in order to service an IO request – no machine talks to any other machine in the synchronous write or read path. Shared-Everything makes it easy to linearly scale performance just by adding CPUs and thereby overcome the law of diminishing returns that is often found when shared-nothing architectures are scaled up. Clusters can be built from 1,000s of VAST servers to provide extreme levels of aggregate performance… the only scalability limiter is the size of the Fabric that customers configure.
Storing all the system’s metadata in shared, persistent XPoint SSDs eliminates the need to maintain any coherency between Servers and eliminates the need for power failure protection hardware that would be otherwise required by volatile and expensive DRAM write-back caches. VAST’s DASE architecture pairs 100% nonvolatile media with transactional storage semantics to ensure that updates to the Element Store are always consistent and persistent.
VAST Servers do not, themselves, maintain any local state – thereby making it easy to scale services and fail around any Server outage. When a VAST Server joins a cluster, it executes a Consistent hash tables are a special form of hash tables that minimize the amount of data that must be moved to expand the hash table. function to locate the root of various metadata trees. As Server resources are added, the cluster leader rebalances responsibility for shared functions. Should a Server go offline, other Servers easily adopt its VIPs and the clients will connect to the new servers within standard timeout ranges upon retry.
This Shared Everything cluster concept breaks the rigid association that storage systems have historically built around specific storage devices within a cluster. A VAST Cluster will continue to operate and provide all data services even with just one VAST Server running, as all state is stored in a set of globally-accessible and resilient storage enclosures. If, for example, a cluster consisted of 100 Servers, said cluster could lose as many as 99 machines and still be 100% online.
VAST Enclosures are resilient, NVMe-oF storage enclosures that connect XPoint and QLC flash SSDs to a high-throughput Ethernet or InfiniBand network. The A highly-available NVMe over Fabrics JBOF (Just a Bunch of Flash) making the Optane and QLC Flash SSDs the enclosure contains available to all the VAST Servers in a cluster. features no single point of failure – Fabric modules, NICs, fans, power supplies are all fully-redundant such that VAST Clusters can be built from as little as one Enclosure and scale to 1,000 Enclosures.
Per the figure above, each VAST Enclosure houses two Fabric Modules that are responsible for routing NVMe-oF requests from Ethernet or InfiniBand ports to the Enclosure’s SSDs through a complex of PCIe switch chips. With no single point of failure from network port to SSD, VAST Enclosures combine enterprise-grade resiliency with high-throughput connectivity. While at face-value the architecture of a VAST Enclosure appears similar to a dual-controller storage array, there are in reality several fundamental differences:
Fabric Failover With Single Ported SSDs
Server Pooling and Quality of Service
The VAST Servers in a cluster can be subdivided into Server Pools that create isolated failover domains, making it possible to provision the performance of a arbitrarily-sized pool of servers to a set of users or applications to isolate application traffic and ensure a quality of ingress and egress performance that’s not possible in shared-nothing or shared-disk architectures.
Server pooling also provides the advantage of being able to support multiple networks, simultaneously. Users can build their backend Fabric from a common Ethernet or InfiniBand storage network, while also provisioning additional Server front-end ports to talk across multiple, heterogenous, Infiniband or Ethernet subnets. Pooling makes it easy to add file services to hosts on different networks that all access a global namespace.
The statelessness of the VAST Server architecture makes it easy to dynamically provision server pools (even programmatically by API) to adapt to the needs of an evolving application stack.
Legacy architectures frequently require clusters to be built from homogenous pools of servers and storage. This process of infrastructure pooling often creates rigid boundaries which data can be stripped within and often forces forklift upgrades of whole clusters to add system capacity or performance. VASTOS, on the other hand, is designed to fully support asymmetrical expansion making it simple to add heterogeneous resources to the shared pool of CPUs and SSDs that compose a VAST cluster.
The path to asymmetry is paved with a few architectural advantages that VAST enjoys, namely that flash storage breaks a long-standing tradeoff between performance and capacity that has been exhibited by HDDs. Unlike HDDs which long ago shed any illusions that performance is increasing across HDD generations, the long-term evolution of Flash has shown that performance (in terms of IOPS and bandwidth) has evolved proportionately with capacity. By applying this concept at the system level – as VAST storage grows denser, it will also grow proportionately in terms of performance… such that you can think of an asymmetrically scaled Flash architecture as intrinsically well-balanced over time.
To accomplish the objective of asymmetry and erase the boundaries of infrastructure utilization in an evolving cluster, VASTOS virtualizes underlying hardware as much as possible in order to enable intelligent scheduling of system services and data placement at two levels:
DASE Architecture Benefits Summary
VAST’s DASE architecture has several distinct advantages when compared to more traditional shared-nothing and dual-controller architectures. These include:
How It Works
Any modern storage system is defined by its software, and VAST Universal Storage is no exception to this rule. VAST’s Universal Storage offering is a SW-defined, API-driven architecture implemented on commodity infrastructure.
The Element Store is a combination of capabilities and inventions that make the Universal Storage concept possible, including:
The Element Store: the VAST Universal Storage Namespace
The Element Store is the heart of VAST Data’s Universal Storage. To create a global storage namespace, The Element Store abstracts the hierarchical structure of a file system and the flat structure of object buckets onto a unified set of file/object metadata. The Element Store treats file and object protocols as co-equal conduits to access a common set of data elements.
The VAST Element Store manages all the SSDs in a VAST Cluster to form a single pool of storage and manages that capacity as a single namespace accessible as a file system and/or object store.
All the Element Store’s metadata, from basic file names to multiprotocol ACLs and locks, is maintained on shared media in VAST enclosures. This allows the mirrored and distributed metadata to serve a consistent single source of truth regarding the state of the Element Store.
Eliminating the need for server-side caching also eliminates the overhead, and once again complexity, of keeping cached data coherent across multiple storage controllers. VAST systems store all their system state in the shared enclosures that are globally accessible to each server over NVMe-oF. Since each VAST Server accesses a single source of system state directly, they don’t need to create any east-west traffic between nodes that shared-nothing systems require in order to update each other’s caches. This has 2 significant advantages:
The VAST Element Store maintains its persistent metadata in V-Trees. VAST’s V-Trees are a variation of a B-tree data structure, specifically designed to be stored in shared, persistent memory. Because VAST Servers are stateless, a new metadata structure was needed to enable them to quickly traverse the system’s metadata stored on remote XPoint devices. To achieve this, VAST designed a tree structure for extremely wide fan-out, each node in a V-Tree can have 100s of child elements – thus limiting the depth of an element search and the number of round trips over the network to no more than seven hops.
While it isn’t organized in tables, rows, and columns, the Element Store’s V-Tree architecture enables the metadata store to act in many ways like a database – allowing VAST Servers to perform queries in parallel and locate, for example, an object in an S3 bucket, or the same data as a file, as it existed when the 12.01 AM January 1, 2024 snapshot was taken.
Just as CPUs add a linearly scalable unit of capacity, Element Store metadata is distributed across all the cluster’s 3D Xpoint which enables the namespace to scale as well enabling performance to scales as more and the VAST Cluster scales.
The VAST Element Store was designed to marry the transactional guarantees of an ACID database with the performance of a parallel file system and the scale of an object store. To achieve this goal, VAST Data needed to create a new distributed transaction model, hybrid metadata structures that combine consistent hashing and tree-oriented metadata and new locking and transaction management.
At its core, the Element Store manages its metadata across a shared pool of 3D XPoint using consistent hashing of each element’s (file, object, folder, Etc) handles (unique identifiers), The hash space is divided into ranges with each range assigned to two of the cluster’s enclosures. Those two enclosures then hold the metadata roots for elements whose handles hash to values in ranges they are responsible for.
VAST Servers load the 1GB consistent hash table into memory as they re-join the cluster at boot. When a VAST Server wants to find data within a particular file or object, it calculates the hash of the element’s handle, and performs a high-speed memory lookup to find which enclosure(s) hold that element’s metadata by hash range. The VAST Server can then read the element’s V-Tree from the enclosure that is responsible for that portion of the hash space.
By limiting the use of consistent hashing to only the root of each data element, the dataset size per each hash table is very small and the Element Store eliminates the risk of hash collisions as a result. Because hash tables are only required for the root of each data element – the system scales while minimizing the amount of hash data that has to be recalculated when VAST clusters expand. When a new enclosure is added to a cluster, it assumes ownership of its share of the consistent hash space and only the metadata from those buckets migrated to the new enclosure.
Unlike eventually-consistent systems (such as object storage), the VAST Element Store provides a consistent namespace through all the VAST Servers in a cluster to all the users. Changes made by a user on one node are immediately reflected and available to all other users.
To remain consistent, the VAST Element Store ensures that each transaction is atomic. To achieve this, each storage transaction is either applied to the metadata (and all of its mirrors) in its entirety, or not applied to the metadata at all (even if a single transaction updates many metadata objects). With atomic write consistency, the need for classic file system check tools (such as the dreaded fsck) is no longer relevant, and systems can be instantaneously functional upon power cycle events.
For operations that fail before step #4 is successfully completed, the system will force clients to retry a write and any old/stale/disconnected metadata that is remaining from the previous attempt will be handled via a background scrubbing process.
If, by comparison, the system updated it’s metadata from the top down, that is first add a new extent to the file descriptor followed by the extent, and block metadata; a failure in the middle of the transaction would result in corrupt data from pointers to nothing. Such a system would have to treat the entire change as one complex transaction, with the file object locked the whole time. Updating from the bottom the file only has to be locked when the new metadata is linked in (3 writes vs. 20) shorter locks reduce contention and therefore improve performance..
As the VAST Server writes changes to the V-Tree, it creates new metadata objects with the transaction token attached. When another VAST server accesses an element’s metadata, it checks the transaction token state and then takes the appropriate action using the latest data for committed transactions. If a VAST Server (requesting Server) wants to update a piece of metadata that is already part of an in-flight transaction, it will poll the owning Server to ensure its still operational and will subsequently stand down until the owning server completes the in-flight transaction. If, however, a requesting Server finds in-flight data that is owned by another non-responsive owning Server, the requesting Server will access the consistent state of the namespace using the previous metadata and can also cancel the transaction and remove the metadata updates with that token.
Metadata locks, as with Transaction Tokens, are signed with the ID of the VAST Server that reserved a lock. When a VAST Server discovers a metadata object is locked, it contacts the server identified with the lock, while preventing zombie locks, without the bottleneck of a central lock manager. If the owning Server is unresponsive, the requesting Server will also ask another non-involved Server to also poll the owning Server as a means to ensure that the requesting Server does not experience a false-positive and prematurely fail an owning Server out of the VAST Cluster.
To ensure that write operations are fast, the VAST Cluster holds read-only copies of Element Locks in the DRAM of the VAST Enclosure where the relevant Element lives. In order to quickly verify an Element’s lock state, a VAST Server performs an atomic RDMA operation to the memory of an Enclosure’s Fabric Module to verify and update locks.
While the above Locking semantics apply at the storage layer, the VAST Cluster also provides facilities for byte-granular file system locking that is discussed later in this document (see: Universal Storage: Protocols).
A Thin-Provisioned, Byte-Granular Data Store
While much of the focus of the above has been about system metadata, and how to ensure consistency of operations in a scale-out cluster – it’s important to point out the innovations that can be found in the data store of the VAST Cluster:
Inherently Scalable Namespace
Users have long struggled with the scaling limitations of conventional file systems for a variety of reasons: from metadata slowdowns when accessing folders that contain more than a few thousand files, other systems deal with challenges around having preallocated inodes that limit the number of files that can be stored in a filesystem as a whole. These limitations were one of the major drivers behind the rise of object storage systems in the early 2000s. Freed from the limitations of storing metadata in inodes and nested text files, object stores scale to billions of files over petabytes of data without being bogged down.
The VAST Element Store provides the essentially unlimited scalability promised by object stores for applications using both object and file access methods and APIs. The Element Store’s V-Trees expand as enclosures are added to the VAST cluster, with consistent hashes managing the distribution of V-Trees across enclosures (all without explicit limits on object sizes, counts or distribution).
A Breakthrough Approach to Commodity Flash Management
In order to build a system that provides game-changing Flash economics, VAST Data engineered a system architecture to accommodate the absolute lowest-cost, commodity Flash.
While commodity Flash has an economic appeal that makes it the media du jour in the hyperscale market – the hyperscale companies that have embraced this style of low-cost Flash also write their own applications to overcome the performance and write endurance limitations of these devices. Since most organizations don’t have the luxury of rewriting their applications to benefit from commodity Flash; the VAST Data team endeavored to build a system that could serve as an abstraction between any application and commodity Flash that certainly was not designed for legacy applications.
In order to enable customers to adopt this new class of Flash technology – VAST Data designed a new Flash translation layer and data layout to accommodate the unique geometry of QLC, and future PLC, Flash and a new approach to data protection that minimizes namespace fragmentation.
Challenges With Commodity (QLC) Flash
The low cost of commodity, hyperscale-grade SSDs – using QLC flash today and, PLC in the future – is an important part of how a VAST Universal Storage cluster can achieve archive economics. There are a few barriers that present challenges for legacy systems to use this new type of Flash technology:
QLC Flash Page Size and Erase Block Challenges
SSD endurance is the total amount of data that an SSD’s manufacturer warrants a user can write to the drive before the drive fails. Until recently, SSD vendors reported the endurance of their products in DWPD (Drive Writes Per Day for 5 years) or TBW (Terabytes Written) based on a JEDEC standard workload that is based on 77% of the writes being 4 KB or less. This makes sense because many legacy storage products were designed to write in 4KB disk blocks and 4KB is how people tend to think about IOPS workloads – which is what people have been buying flash for in the first place. However, as QLC Flash matures in the market and SSD vendors wrestle with how to position low-endurance media for enterprise workloads, they’ve now started to reveal more information about how their SSDs react to different write patterns.
As seen in the chart below, the workload/endurance curve for the p4326 15.36TB SSD from Intel QLC SSDs features multiple endurance ratings depending on the size and sequentiality of how the drive is written to:
QLC SSDs provide 20 times more endurance when they’re written to in large sequential stripes as compared to being written to in the 4KB random writes, as is often common in many enterprise storage systems.
Why Write Geometry Matters for QLC SSDs
To understand why small writes consume so much more of a SSD’s endurance than large writes, we have to dive into the internal structure of the flash chips themselves. While Flash is often referred to as flash memory, the NAND flash in a data center SSD is very different from DRAM. Where DRAM and 3D Xpoint are bit-addressable media, Flash, like HDDs, can only be addressed in discrete blocks.
The flash on each chip in an SSD is organized into a hierarchy of pages, the smallest unit of data that can be read from or written to flash. Flash pages are then assembled into much bigger erase blocks. The SSD controller can write to any blank flash page once, but the page must be erased before it can be written to again. To erase a flash page and free up capacity for a subsequent write, a flash chip applies a high voltage to all the cells in an erase block, erasing every cell in the block. In many ways, a flash erase block is like an Etch-a-Sketch. You can draw on each area of the screen once before you have to turn it over, shake and erase the whole screen… you can’t just erase the flash pages that you want to recover capacity from.
The pages in today’s QLC flash chips are sized anywhere between 64KB to 128KB, and the erase blocks are up to 200MB. This unit of capacity management is monumentally larger than what classic storage systems were ever designed to deal with. When a system is not designed to deal with these two levels of capacity management, problems creep in.
When the SSD receives large, sequential writes it can write full flash pages to the SSD. When the SSD receives a 4KB write, the SSD is forced to only save that 4KB to a 64KB flash page. Since flash pages are write once, any 4KB write results in 60KB of wasted flash space and more importantly wasted endurance because the unused capacity must be fully compacted by an SSD in order to reclaim unused capacity. Each time the SSD compacts data and moves it from one set of pages to another it consumes one of the limited number of write/erase cycles the flash can endure. This behavior that occurs, when a storage subsystem needs to move data more frequently than when a user writes data in order to free up fragmented flash capacity, is commonly referred to as write amplification… and write amplification is the enemy of low-endurance Flash media.
As seen, the geometry, or size, of a Flash write is a critical aspect of determining SSD longevity, but it is not the only issue. As explained below, data protection (eg:: RAID striping) is also an area where storage systems can create data fragmentation, causing write amplification.
Universal Storage Data Structures: Optimized for Commodity Flash
In order to ensure that data is appropriately placed in Flash according to a drive’s optimized geometry, it is critical that an application is never able to write into a region of a Flash drive in place. Since the applications don’t know about the limitations of the Flash and the Flash doesn’t know what the applications are trying to accomplish, an abstraction needs to be created between these two in order to ensure that the SSDs realize the best possible longevity.
VAST’s design goal has been to use commodity media and avoid imposing upon customers the tax associated with proprietary flash drives or blades. In order to marry this objective with QLC low-endurance COTS flash drives, the architecture has been designed to essentially ‘spoof’ the commodity drives into not knowing what’s happening at the logical namespace layer – they therefore never need to manage data at the block level. When it comes to data placement, The VAST Element Store takes many of the concepts that have been popularized by log-structured file systems and extends them to maximize the efficiency of VAST’s Similarity-Based Data Reduction, enable the use of very wide write stripes and minimize write amplification to commodity SSDs. Because write cycles are precious, it’s critical that a system control writes globally with absolute precision… and controlling data placement starts with indirection.
Conventional storage systems maintain static relationships between logical locations, like the 1 millionth-4 millionth bytes of Little_Shop_of_Horrors.MP4, and the physical storage locations that hold that data. As such, when an application writes to that location, the system physically overwrites the old data with the new in place. This, of course, becomes problematic for commodity SSDs when these overwrites aren’t the same size as a flash erase block – as anything less than a perfect large (200MB) write will create write amplification.
The VAST Element Store performs all writes by way of indirection. Rather than overwriting data in place, the Element Store writes data to free space on XPoint SSDs and builds metadata pointers associating the logical location that the new data was written to with its dynamic physical location. When the system later reduces, erasure-codes, and migrates that data to free space on flash SSDs, it updates the metadata pointers to indicate the new location of data.
If we oversimplify a little, an element (file, object, folder, symlink, etc.) in the VAST Element Store is defined as a series of pointers to the locations on 3D XPoint or flash… where the data is stored at any given time.
With Large Data Stripes, Drives Never Need To Garbage Collect
Legacy indirection-based file systems use a single logical block size through all their data services. Using 4KB or 32KB blocks for data reduction, data protection and data storage – this fixed unit of data size keeps things simple for traditional storage systems, especially when sizing file system metadata stores because the systems can always point to a consistently-sized data block.
The problem with this approach is that it forces the storage architect to compromise between the low-latency and fine-granularity of deduplication of small blocks vs. the greater compressibility and lower flash wear of larger blocks. Rather than make that compromise, the VAST Element Store manages data at each stage of its journey:
The Element Store writes to SSDs in 1MB I/Os, a significant multiple of the underlying flash’s 64KB-128KB page size, thereby allowing the SSD controller to parallelize the I/O across the flash chips within an SSD and also completely fill flash pages, preventing the write amplification that is caused by partially written pages.
The Element Store manages data in deep data strips that layer 1MB I/Os into a larger 1GB strip of data on each SSD. Just as writing 1MB I/Os to the SSDs aligns the write to a number of full flash pages, managing data in 1GB strips aligns the deletion of data from the SSDs with the SSD’s internal 200MB erase block boundary.
Remember, the VAST Cluster writes in free space, such that any block write or overwrite never happens within a pre-existing strip of data. As with many other operations in the VAST Cluster, garbage collection is performed global across the shared-everything cluster. When enough data from a stripe has been deleted such that the Element Store needs to perform garbage collection, it deletes a full 1GB of data that was previously written to the SSD sequentially. If all of the pages in a block have not been invalidated, the remaining data is written to a new data stripe – but this is an uncommon occurrence thanks to Universal Storage’s Foresight, also known as Predictive Placement, is the process of predicting the expected lifetime of a piece of data and storing data with similar life-expectancies together to minimize garbage collection and the write amplification it creates. feature (discussed later). In essence, this approach allows the system to cleanly write to, and subsequently delete data from, a large number of flash erase blocks while avoiding the SSD from wearing itself out by doing its own unoptimized garbage collection.
The 1GB strip depth and 1MB sub-strip depth are not fundamentally fixed values in the VAST Universal Storage architecture but were determined empirically by measuring the wear caused by writing and erasing multiple patterns on today’s generation of QLC SSDs. As new SSDs are qualified and as PLC flash enters the market, the VAST Cluster write layout can adjust to even the larger writes and erases future generations of flash will require to minimize wear.
The underlying architecture components make all of this intelligent placement possible:
Endurance is Amortized: Wear-Leveling Across Many-Petabyte Storage
Universal Storage also extends flash endurance by treating the petabytes of flash in a cluster as a single pool of erase blocks that can be globally managed by any VAST Server. The VAST Cluster performs wear-leveling across that pool to allow a Universal Storage system to amortize even very-high-churn applications across petabytes of flash in the Cluster. Because of its scale-out design, the VAST Cluster only needs to work toward the weighted overwrite average of the applications in an environment, where (for example) a database that continually writes 4KB updates will only overwrite a fraction of even the smallest VAST cluster in a day.
The large scale of VAST clusters is also key to the system’s being able to perform garbage collection using the 150GB stripes that minimize flash wear and data protection overhead. Unlike some other write-in-free-space storage systems, the VAST Element store continues to provide full performance up to 90% full.
VAST Foresight: Retention-Aware Data Protection Stripes
While VAST’s data layout writes only in large 1GB strips in order to ensure that the SSDs don’t wake up and perform their own unoptimized garbage collection, this does not solve the problem created by having multiple different streams of data write data of different varieties into a common erase block or erasure code stripe. It could be argued that, with 150+4 striping, VAST Clusters create more opportunity for write amplification by writing erasure code stripes that are comparatively much larger than legacy storage systems. The reason for this is simple: larger write stripes are filled with data from many different ingest streams… and these streams don’t write data of a common lifetime retention period, so – when short term data is deleted, the still-valid data in a write stripe must be re-written to a new write stripe, resulting in write amplification.
To address this problem, VAST created Foresight, its method of intelligently placing data into erase blocks according to the expected lifespan of a write. Foresight is based on the simple truth that write amplification is minimized when write stripes consisting of invalidated, or mostly invalidated, flash pages are deleted at the same time.
Until Foresight, journaled and log-structured storage systems basically wrote data to their SSDs in the order the data was written to the system. With SSDs that have large (many-100MB) erase blocks, data from multiple application streams can become intermixed. For example, a single erasure stripe will contain items that will have a long life expectancy, like the contents of a media database, and other data items that have a very short life expectancy, like the indices, transaction logs and scratch files created by that same database engine. Even if this data is stored in different locations to the storage, internal methods of data aggregation provided by storage arrays and Flash SSDs will ultimately write the data as it arrives on adjacent pages within common erase blocks that form an erasure stripe.
When a system subsequently performs garbage collection, the short-lived data will have mostly expired, but these data blocks will have been intermixed with the more permanent data written at the same time – leaving a lot of garbage to be collected, and the still-valid pages will create write amplification as the system reclaims SSD capacity. This process repeats over time as the system continues to perform garbage collection of data that has variable retention lifespans – and can result in the same data being moved 10 or 20 times over the lifespan of an SSD, consuming more of the SSDs endurance each time.
Foresight uses the information provided by the data’s attributes as well as application overwrite history in order to predict the life expectancy of each new piece of data as its ingested. In the VAST Cluster’s XPoint buffer, the system manages a series of buffers that enable the system to fill and flush these buffers asynchronously – in this way Universal Storage breaks from the classical definitions of a log-structured storage system. These buffers are filled with data of a common retentive period – determined by VAST’s Foresight algorithms – so each erasure-coded data stripe can be laid down to storage and not result in much data movement when the data in the stripe is ultimately expired, as it should all expire around the same time.
Ephemeral data, such as temporary files and transaction logs, will be written to one “hot” stripe of short-lived data, whereas other stripes can be formed concurrently and consist of “cold” or inactive/archive data which is expected to be retained forever. The VAST Cluster’s XPoint write buffer is so large that it enables the system to create a multitude of retention-based write stripes, in fact very wide write stripes, while also enabling these stripes to flush to commodity flash only when a full-stripe is created.
The life expectancy prediction is stored in the V-Tree as a metadata value for data, based on:
While the system cannot be completely omniscient, such that it can absolutely know the life expectancy of any write, a VAST Cluster will also optimize data placement once it has to move a block of data via a garbage collection process. If that piece of data outlived the rest of the data in its original write stripe, it’s more likely to live even longer. When the Element Store performs garbage collection, it chooses the stripes with the least remaining data, which usually turn out to be the stripes that were originally filled with ephemeral (short-lived) data. As discussed in detail in the Garbage Collection section below, the garbage collection process will copy whatever data was incorrectly tagged as ephemeral to a new stripe, with a higher life expectancy.
QLC and 10-Year Endurance
One of the primary design tenets of the VAST Element Store is to minimize the write amplification caused by normal namespace churn. The Element Store writes and manages data on SSDs in alignment with the internal page and erase block structure of the flash, wear levels across SSDs in the cluster and acts as a global flash translation layer extending flash management from being strictly an SSD function to managing a global pool of flash.
The result of this combination of inventions is a Cluster architecture that requires 1/10th of the endurance that can be realized from today’s generation of commodity SSDs. While this super-naturally high level of system-level endurance paves the way for VAST’s future use of even denser flash media, such as PLC flash, it also enables customers to rethink the longevity of data center infrastructure. To facilitate a longer investment lifespan, VAST Data provides a system warranty that can be extended for as many as 10 years, where the drive and its endurance are both covered as part of an active maintenance agreement.
A Breakthrough Approach to Data Protection
Breaking the Resilience vs. Cost Tradeoff
Protecting user data is the primary purpose of any enterprise storage system, but conventional data protection solutions like replication, RAID, and Reed-Solomon erasure coding force difficult trade-offs between performance, resiliency, storage overhead, and complexity. As referenced in our previous blog on this topic2, the VAST Element Store combines 3D XPoint and NVMe-oF with innovative erasure codes to deliver high resiliency and performance with unprecedentedly low overhead and complexity.
One of the primary initial design tenets of the VAST Cluster architecture was to bring the system overhead down from 66% (an acute case… triplication overhead) to as little as 2% while also increasing the resilience of a cluster beyond what classic triplication and/or erasure codes today provide. The result is a new class of error correction codes that deliver higher resiliency (Millions of hours Mean Time To Data Loss) and lower overhead (typically under 3%) than previously possible.
VAST Locally-Decodable Error Correction Codes
As discussed, the VAST Cluster architecture uses highly-available storage enclosures to hold all the system’s flash and 3D XPoint SSD. Because the VAST enclosure is fully-redundant, with no single points of failure, it is possible to implement a stripe with lower overhead than a shared-nothing architecture where a node is a unit of failure. For example:
Wide write stripes are the means to get the storage efficiency to an all-time low level, but wide stripes also increase the probability of multiple device failures within a stripe. While flash SSDs are very reliable devices, especially when compared to HDDs, It’s simply more statistically likely that two of the 152 SSDs in a 150+2 stripe will fail than 2 strips out of the 12 SSDs in a stripe coded 10+2 .
To solve the problem of increased failure probability, VAST Clusters:
While adding additional parity strips to a write stripe helps increase the stripe’s resilience, this is not enough to counterbalance the higher probability of failure that results from large write striping. The other aspect of ensuring high-resilience in a wide write striping cluster is to minimize the time to recovery.
Storage systems have historically kept erasure-coded stripe sizes narrow because the RAID and erasure codes they use, often based on the Reed Solomon error correction algorithm, require systems to read all the surviving data strips in a protection stripe, and one or more parity strips, to regenerate data from a failed part of the protection stripe. To understand this problem in practice, consider the VAST Cluster’s 150+4 stripe width – in order to rebuild 1GB of data on an encoded stripe of 150+4, a rebuild event would require reading 150 GB of data from the surviving SSDs if such a system was built to provide a Reed-Solomon style of recovery.
To get around this problem, VAST Data designed a new class of erasure code borrowing from a new algorithmic concept called Locally Decodable Codes. The advantage of Locally Decodable Codes is that they can reconstruct, or decode, data from a fraction of the surviving data strips within a protection stripe. That fraction is proportionate to the number of parity strips in a stripe. That means a VAST cluster reconstructing a 150+4 data stripe only has to read 38 x 1GB data strips, just 1/4th of the survivors.
How Locally-Decodable Erasure Codes Work
3D XPoint Eliminates the Need for Nerd Knobs
Writes to 3D XPoint are mirrored and the low-latency of 3D XPoint combines with the fact that unlike flash, 3D Xpoint devices don’t have the LBA, page, block hierarchy or the complications that hierarchy creates so their prodigious endurance isn’t dependent on write patterns the way flash SSDs are. This means the 3D XPoint write buffer provides consistently high write performance regardless of the I/O size mix. Any data written by an application is written immediately to multiple Optane SSDs. Once data is safely written to 3D XPoint the write is acknowledged to the application. Data is migrated to QLC later, long after the write has been acknowledged.
On the flip side, the random access of Flash combines with VAST’s wide write stripes and parity declustering to ensure that reads are very fast because all writes are parallelized across many flash devices. That said, Locally Decodable erasure codes do not require a full-stripe read on every operation – to the contrary, reads can be as little as one SSD block and as large as a multitude of write stripes. In this way, there is only a loose correlation between the width of a stripe and a width of a file, object, directory or bucket read.
Extensible Erasure Codes
Intelligent, Data Only Rebuilds
Because of the shared-everything nature of the DASE architecture, recovery process are sharded across all the VAST Servers in the cluster and VAST’s declustered approach to data protection ensures that every device has some amount of data and parity from an arbitrary collection of stripes – making it possible to federate a reconstruction event across all of the available SSDs and all of the Server resources in a VAST Cluster. A large Universal Storage System will have dozens of VAST Servers sharing the load and shortening rebuild time.
A Fail-in-Place Cluster
To protect user data against the silent data corruption that can occur within SSDs, the VAST Element Store keeps a checksum for each data and metadata block in the system. These checksums are CRCs calculated from the data block or metadata structure’s contents and stored in the metadata structure that describes the data block. The checksum for a folder’s contents is stored as part of its parent folder, the checksum for a file extent’s contents in the extent’s metadata and so on.
When data is read, the checksum is recalculated from the data block and compared to the checksum in the metadata. If the two values are not the same, the system will rebuild the bad data block using locally decodable parity data.
Solutions that store checksums in the same block as the data, such as T10-DIF, only protect data from SSD bit rot, but they do not protect the data path from internal system data transfer errors. If, in the case of T10-DIF, there is as little as a 1-bit error when transmitting an LBA address from an SSD, and the SSD actually reads LBA 33,458, a T10-based system will return the wrong data, because it’s reading the wrong LBA, but the data and checksum will match because they’re both from LBA 33,458 even though we wanted the data from 33,547.
To protect data being stored on the system from experiencing an accumulation of bit errors, a background process also scrubs the entire contents of the system periodically. Unlike the CRCs that are attached to VAST’s variable-length data blocks, the system creates a second set of CRCs for 1MB flash sub-strips in order to swiftly perform background scrubs. This second level of CRCs solves the problem of customers who store millions, or billions, of 1-byte files – VAST’s background scrubber can deliver consistent performance irrespective of file/object size.
Low-Overhead Snapshots and Clones
A write-in-free space storage architecture is especially suitable for high-performance snapshots because all writes are tracked as part of a global counter and the system can easily maintain new and old/invalid state at the same time. VAST brings this new architecture to make snapshots painless, by eliminating the data and/or metadata copying that often occurs with legacy snapshot approaches.
The VAST Element Store was designed to avoid several of the mistakes of previous snapshot approaches and provides many advantages for Universal Storage users.
Instead of making clones of a volume’s metadata every time a snapshot is created, VAST snapshot technology is built deep into the metadata data structure itself. Every metadata object in the VAST Element Store is time-stamped with what is called The snaptime is global system counter that dates back to the installation of a VAST cluster and is advanced synchronously across all the VAST Servers in a cluster. Metadata updates are timestamped with current snaptime as they’re written.. The snaptime is a global system counter that dates back to the installation of a VAST cluster and is advanced synchronously across all the VAST Servers in a cluster approximately once a minute.
As with data, metadata in the VAST Element Store is never directly overwritten. When an application overwrites a file or object; the new data is written in free space across the 3D XPoint write buffer. The system creates a pointer to the physical location of the data and links that pointer into the V-Tree metadata that defined the object.
Following VAST’s general philosophy of efficiency; VAST snapshots and clones are finer-grained than those on many conventional storage systems – allowing users to schedule, or orchestrate, snapshots of any arbitrary file system folder or S3 bucket in the cluster’s namespace without forcing artificial abstractions like RAIDsets, volumes, or file systems into the picture.
A Breakthrough Approach to Data Reduction
Storage vendors early-on realized that Flash devices provide the benefit of random access, and random access is key to cutting up data and reducing it using classic deduplication and compression approaches – such that data reduction has been used to reduce the effective cost of a flash GB since the first generation of true all-flash arrays. These approaches have proven effective for data that is easily compressible or datasets that exhibit a high occurrence of exact matches, but the conventional wisdom has always been that these data reduction methods have been ineffective for datasets with a higher level of entropy, such as unstructured data.
VAST’s new approach to data reduction goes beyond classic data reduction techniques to provide new mechanisms that can find and isolate entropy in data at a much finer level of granularity than previously possible – to make it possible to find and reduce correlations across structured, unstructured and big data datasets.
Beyond Deduplication and Compression
Historically, applications and storage systems have employed two complementary technologies to reduce data: compression and data deduplication.
Traditional compression operates over the limited scope of a block of data or a file. The compressor identifies small sets of repeating data patterns, such as the nulls padding out database fields or frequently used words, and replaces those repeating patterns with smaller symbols. Decompression uses the data to symbol dictionary used during compression to reverse the process and replace symbols with larger data strings. Compression, particularly with unstructured data, is often applied by the application such that storage systems are rarely able to provide additional benefit. Moreover, because the application dictates the content – application-level compression can be semantically optimized for various file types, such as image (JPEG), video (H.264), genome (CRAM) compression, etc..
Data deduplication, by comparison, identifies much larger blocks of data that repeat globally across a storage namespace. Rather than storing multiple copies of the data, a deduplication system uses pointers to direct applications to single physical instance of a deduplication block.
The coarse block hashing that is employed by deduplication is much more sensitive to very small differences in data and therefore force a deduplication system to store a new full block even when there is just a single byte of difference between deduplication blocks. To minimize this sensitivity to entropy in data, deduplication systems can hash to smaller block sizes – but this has the negative effect of driving up the size of the hash index. Smaller deduplication blocks create the need for more metadata, most significantly the used hash tables that must be kept in DRAM for decent performance and CPU overhead, which limits the size of a deduplication realm for single-controller approaches to deduplication and, more importantly, presents a law of diminishing returns as the investment in DRAM at some point overcomes the savings delivered by a deduplication appliance.
Similarity Reduction to the Rescue
While the combination of QLC Flash, 10 years of system longevity and breakthrough erasure code efficiency all compound into an economically-compelling proposition for organizations who are looking to reduce the cost of all-flash or hybrid flash+HDD infrastructure, VAST’s new approach to global data reduction, Similarity-Based Reduction, further redefines the economic equation of flash to make it possible to build a system that provides a total effective capacity acquisition cost that rivals or betters the economics of HDD infrastructure. The objective is simple: to combine the global benefits of deduplication-based approaches to data reduction with the byte-granular approach to pattern matching which until now has only otherwise been found in localized file or block compression.
It’s counter-intuitive, but the only way to build a system that can beat the economics of an HDD-based storage system is with commodity Flash – as the routines that a Flash system can use to cut up and correlate data for purposes of data reduction result in a block size on disk that is extremely small, turning every workload into an IOPS workload – where such a block size on HDDs would result in tremendous fragmentation and HDDs simply can’t deliver enough IOPS to support fine-grained data reduction.
Similarity Reduction: How it Works
Much like traditional deduplication/backup appliances, VAST’s Similarity-Based Data Reduction starts by breaking data into coarse blocks and hashing these blocks after new writes have been persisted into the 3D XPoint write buffer. This, however, is where the parallels to deduplication appliances, end – as the hash function executed by a VAST Cluster is very different from the types of hashing implemented by deduplication systems.
It’s a bit of an oversimplification, but Similarity Hashes can be thought of as language clues or semantic markers that suggest a high degree of correlation between different blocks. If two blocks generate similar hashes, they’re likely to reduce well when compressed with a common compression dictionary.
To create a cluster of Similarity-hashed and compressed blocks, VAST Clusters compress the first block that generates a specific hash value and declare that block a reference block for that hash. When another block hashes to the same hash ID, the system then compresses the new block against the reference block to create a difference (or delta) block. The data that remains from this global reduction method is a dictionary, which is stored in 3D XPoint as an attribute of the compressed reference block, and the delta objects that are stored as reduced symbols that can be as little as a single-byte and who do not also require their own dictionary. If two blocks are exactly the same, there is simply no delta to store.
When applications read from a VAST Cluster, Servers traverse to the different compression dictionaries that are distributed across VAST V-Trees… enabling the system to minimize the scope of decompression to just a single reference block, in essence enabling the cluster to read 4KB objects within 1ms because the cluster doesn’t have to decompress the whole namespace per every read as would be the case if all of the data was managed in a single dictionary.
Imagine being able to zip your entire namespace and read from this namespace with millisecond latency… this is the design point for VAST’s Similarity-Based Data Reduction.
Similarity Reduction in Practice
The efficiency that can be gained from Similarity-Based Data Reduction is of course data-dependent. While encrypted data will see almost no benefit from this approach to data reduction, other applications will often see significant gains. Reduction gains are, of course, relative – and where a VAST Cluster may be 2x more efficient than a legacy deduplication appliance for backup data (at, say, 15:1 reduction), a reduction of 2:1 may be as valuable in an unstructured data environment where legacy file storage systems have not ever been able to demonstrate any reduction.
Some examples of VAST’s similarity reduction, derived from customer testing,, include:
VASTOS: the VAST Operating System
VASTOS, or the VAST Operating System, is the distributed software framework that exposes the namespace to client applications and provides the service of scale-out cluster management.
VASTOS Cluster Management
Just as Universal Storage was designed to redefine flash economics, VAST Clusters have been equally designed to minimize the cost of scale-out system operation by simplifying the system’s full lifecycle of operation – ranging from day-to-day management tasks such as creating NFS Exports and quotas… all the way to performing automated, non-destructive updates and expanding the cluster online.
The VAST Cluster today supports both hard and soft directory-level quotas. Any folder or S3 bucket in the system can be assigned a quota, which applies to that folder and all its contents.
VAST Clusters today support LDAP with added support for NIS environments. Active Directory support will come with the release of VAST’s SMB Protocol Server.
To resolve the limitation of 16 LDAP Groups that are supported by NFSv3, VAST systems do not recognize the truncated list of groups that are communicated by the NFS client and will, instead, periodically poll the directory service itself to determine the proper Group limits in an environment.
The VASTOS Management Service
The VASTOS Management Service (also known as VMS) is responsible for all the system and cluster management operations across a VAST Cluster. VMS is a highly-available service that runs at any time on one of the Docker containers in a VAST Cluster. VMS functions include:
VMS runs in a container that is independent from the container that runs the VAST Server and Protocol Manager processes. The VMS container only runs management processes that take instructions from the users, and exchanges data with the VAST Servers and Enclosures in the cluster. All VMS operations are out-of-band from the VAST Protocol Manager as to ensure consistent I/O performance.
While VMS today runs on a single server, it is also designed to be highly-available. Should a server running VMS go off-line, the surviving servers will detect that it isn’t responding and first verify it is really dead before holding an election to assign a new host for the VMS container – at which point the new host will then spawn a new VMS process. Since all the management service state, just like Element Store state, is stored in the persistent 3D XPoint, VMS takes over right where it left off when it ran on the failed host.
VMS polls all the systems in VAST Cluster every 10 seconds, collecting hundreds of performance, capacity, and system health metrics at each interval. These metrics are stored in the VMS database where they’re used as the source for the VAST GUI’s dashboard and other in-depth analytics which we’ll examine in more detail below.
The system allows an administrator to examine analytics data over the past year without consuming an unreasonable amount of space, by consolidating samples as they age reducing granularity from a sample every 10 seconds to a sample once an hour for data over 30 days old.
VAST Insight (VAST’s Remote Call Home Service)
In addition to saving those hundreds of different types of system metrics to their local databases, VAST clusters also send encrypted and anonymized analytics data to VAST Insight, a proactive remote monitoring service managed by the VAST Customer Support team. VAST Insight is available on an opt-in basis, customers with strong security mandates are not required to run Insight to receive VAST support.
VAST’s support and engineering teams use this platform to analyze and support many aspects of its global installed base, including:
Insight also provides VAST’s R&D team invaluable insight into how customers actually use Universal Storage Clusters, so we can concentrate our engineering toward efforts that will have the greatest positive impact on customers.
Non-Disruptive Cluster Upgrades
Too many storage technologies today still have to be shut down to update their software. As a result, storage administrators (who perform these outages during off-hours and weekends) will optimize their work/life schedule by architecting storage estates around the downtime events that their systems impose upon them. The result is the use of many, smaller systems, to limit the scale of the outages they have to endure. Another way administrators avoid the perils of upgrade-driven storage downtime is to delay performing system updates, which has the unfortunate side-effect of potentially exposing systems to vulnerabilities and fixes that their vendors have solved for in more current branches of code.
VAST Data believes that disruptive updates are antithetical to the whole concept of Universal Storage. A multi-purpose, highly-scalable storage system has to be universally and continually available through routine maintenance events such as system updates.
The Cluster upgrade process is completely automated… when a user specifies the update package to be installed, VMS does the rest. The statelessness of the VAST Servers also plays an outsized role in making cluster updates simple – as the state of the system does not need to be taken offline in order to update any one computer. To perform the upgrade, VMS selects a VAST Server in the cluster, fails the Server’s VIP (Virtual IP Addresses) over to other VAST Servers in the cluster and then updates the A VAST Server Container is a Docker container holding a logical VAST Server. and the VASTOS Linux image (in the case of an OS update) on the host. VMS then repeats this process, transferring VIPs back to the updated servers as they reboot until all the VAST Servers in the cluster are updated.
Updating the VAST Enclosure follows a similar process. VMS instructs the VAST Enclosure to reprogram an enclosure’s PCIe switches to connect all the SSDs to one Fabric Module, and all the VAST Servers then connect to those SSDs through the still-online fabric module. Once the failover has completed – VMS will update the Fabric Module’s software and reset the enclosure to get it back online in an HA pair.
The ability to scale storage performance, via VAST Servers, independently from Enclosure capacity is one of the key advantages of the DASE architecture. Users can add VAST Servers and/or VAST Enclosures to a cluster at any time.
As we saw in the Asymmetric Scaling section, VAST Servers and Enclosures can be composed of heterogeneous infrastructure as a system scales and evolves over time. VAST Servers using different generations (different generations or makes of CPUs, different core counts) with different VAST Enclosures (differing numbers and sizes of SSDs) can all be members of the same cluster without imposing any boundaries around datasets or introducing performance variance.
When VAST Servers are added to a cluster or Server Pool, they’re assigned a subset of the cluster’s VIPs and immediately start processing requests from the clients along those VIPs, thus boosting system performance. Users can orchestrate the process of adding VAST Servers to their cluster to accommodate expected short-term demand, such as during a periodic data load and releasing the hosts to some other use at the end of that peak demand period thanks to the containerized packaging of VAST software.
When enclosures are added to a VAST Custer, the system immediately starts using the new Optane and QLC flash to store new user data, and metadata, providing a linear boost to the cluster’s performance.
I/O and pre-existing data are rebalanced across the expanded capacity of the system:
Twenty-first century data centers should be managed not by a high priesthood of CLI bashers who are charged with maintaining farms of data storage silos, but by orchestration platforms that manage individual devices not through a CLI but through more structured, consistent, and accessible application program interfaces (APIs).
VMS is built with an API-first design ethos. All VMS management functions on the Cluster, from creating NFS exports to expanding the cluster are exposed through a RESTful API. To take the complexity out of learning how to work with APIs and writing according to various programming languages, the VMS publishes APIs via Swagger. For the uninitiated, Swagger is an open-source API-abstraction that publishes easy tools to build, document and consume RESTful web services.
While the VMS also provides a GUI (details below) and a traditional CLI, VMS’s API-first design means the GUI and CLI consume the RESTful API rather than controlling the system directly. This approach ensures that API calls are sufficiently tested and that all system functions will always be available through the RESTful API. Systems with CLI- or GUI-first design philosophies can often treat their RESTful APIs as second class citizens.
A Modern GUI
While a RESTful API simplifies automating common tasks, GUIs remain the management interface of choice for customers who want a quick and simple view of system operations.
A good dashboard allows mere mortals to quickly understand the health of their storage system in seconds, and provides a full-featured GUI that makes it easy to perform simple tasks while also allowing more curious parties to drill down into the data that feeds that dashboard.
The VAST web GUI is implemented entirely in HTML5 and does not require Java, Adobe Flash, browser plug-ins or any other client software. Administrators can manage their VAST Clusters from any modern browser.
Viewing Analytics Data
The VAST GUI’s main dashboard provides a system’s administrator or another user who has been given RBAC permissions, a quick snapshot of the system’s health. Detailed analytics are also provided to understand what’s happening both at the system-level and at the application level.
VMS’s Top-Actors data allows administrators to reverse-engineer application performance issues by helping them understand how top users, exports and clients are interacting with the system. Top-Actors provides the ability to monitor the top users of the system in real time.
For historical data analysis, VAST Analytics dashboard provides administrators with a rich set of metrics to monitor everything from latency, IOPS, throughput, capacity and more – all the way to the component level.
Administrators can even create and save customized dashboards by combining any number of metrics they find useful in order to determine event correlations.
Role-Based Access Controls
Just as large storage systems need quotas to manage capacity across a large number of users, Universal Storage features role-based access control (RBAC) to enable multiple system tenants to work without gaining access to information that is beyond their purview.
A global administrator can assign read, edit, create and delete permissions to multiple areas of VAST system management, and establish customized sets of permissions as pre-defined roles that can be applied to classes of resources and users.
Options for defining RBAC controls in a VAST Cluster
A Realtime Operating System in Userspace
VASTOS is written in Docker and distributed across Linux cores as a sort-of real-time userspace operating system. The real-time capabilities come from how the system manages all of the fibers in a container. VASTOS manages processes at the fiber level because, unlike threads, fibers are a lightweight thread of execution within a server that do not depend on the kernel’s thread scheduler; instead fibers are designed to orchestrate the future use of other fibers while they execute, and are therefore less prone to interruption from normal kernel housekeeping. The result of managing CPUs at the fiber level is that the system avoids jitter and can deliver very predictable time-to-first-byte latency.
VASTOS: Multi-Protocol Support
While the VAST Element Store manages a VAST Cluster’s media and data, there’s more to a storage system than just storing data. The system also needs to make that data available to users and applications, define and enforce access security and enable cluster administration. VASTOS, more specifically, the VASTOS protocol manager does just that.
Just as The VAST Element Store is a namespace abstraction that combines the best parts of a file system with the scale and extended metadata of an object store; the VASTOS Protocol Manager provides a protocol-independent interface to the individual elements in the Element Store. All supporting protocol modules are considered equals, eliminating the need for gateways, data silos and other protocol conversion hacks. The namespace abstraction provided by the Element Store enables VAST Data to add support for additional file, block, big data and yet-to-be-invented protocols over time simply by adding additional protocol modules.
The individual elements in the VAST Element store are also protocol-independent, all elements, and the entire capacity of the VAST cluster can be made accessible by any supported protocol. This allows users to access the same elements over multiple protocols. To provide an example of multi-protocol access in practice: a gene sequencer can store genomic data via NFS, the same application writing data in NFS can enrich that write with additional metadata via an S3 API call, and this data can be made available via S3 to a pipeline build from some cloud framework.
VASTOS 2.0 supports the most popular file protocol in the open systems world, NFS (Network File System) and the defacto standard for object systems, S3.
VAST NFS – Parallel File System Speed, NAS Simplicity
VASTOS today supports NFS v3, along with several important extensions to NFS v3, these are:
In the mid-2010s – Oracle and a number of other key contributors got together and extended the networking support for NFS to include extensions for RDMA transport of NFS RPCs between the client and server. Thanks to their work, all contemporary Linux distributions now feature support for NFSoRDMA. NFSoRDMA breaks the long-standing TCP bottleneck by replacing TCP as the transport protocol underlying NFS with RDMA (Remote Direct Memory Access) verbs. NFSoRDMA can, of course, run over standard InfiniBand networks that have long-supported RDMA – but can also now run over standard Ethernet networking on any network that can support UDP and ECN – upon which organizations can run v2 of RDMA over Converged Ethernet (ROCEv2). Since RDMA verbs bypass the kernel, a single NFS session to a VAST Cluster can achieve up to 70% of a 100Gbps network’s line speed, or 8.8 GBps and 1 Gbps with 200 Gbps of client to VAST network bandwidth.
The beauty of NFSoRDMA lies in the simplicity of deployment, as the NFSoRDMA client doesn’t require kernel patches or client agents that can make deploying and maintaining high-bandwidth parallel file system storage problematic. Since RDMA verbs are implemented in the rNIC (RDMA Network Interface Card), NFSoRDMA also uses less host CPU resources – providing more compute power for user applications.
VAST S3 -Object Storage for Modern Applications
Amazon’s S3 protocol, or more accurately the protocol used by Amazon’s (S3 Simple Storage Service), has emerged as the de-facto standard for object storage in no small part as it allows developers to support both on-premises object stores like VAST’s Universal Storage and cloud storage such as Amazon S3 and its competitors.
VAST’s Protocol Manager exports S3 objects using HTTP GET and PUT semantics to transfer an entire object via a single request across a flat namespace. Each object is identified and accessed by a URI (A Uniform Resource Identifier, also known as a URL or Uniform Resource Locator, specifies both the location of a resource (EG:TopModel.com/HeidiKlum/red_carpet_325.jpg) and the protocol used to access it (EG:https;//).). While the URIs identifying files can contain slashes (/s) like file systems, object stores don’t treat the slash character as special, so that it has the ability to emulate a folder hierarchy without the complexity a hierarchy creates – slashes are just another character in an internal element identifier.
VAST Objects are stored in a similar manner to files, with the difference being that an S3 object includes not just its contents, but also user-defined metadata that allows applications to embed their metadata about objects within the objects themselves.
While object storage has classically been used for archival purposes – the emergence of fast object access and, in particular, all-Flash object storage has extended the use cases for which object storage is appropriate for. For example, many Massively Parallel Processing (MPP) and NoSQL Databases use object storage as their underlying data repository.
VAST Universal Storage systems support a subset of the S3 verbs that are offered as part of Amazon’s S3 Service. Whereas many of Amazon’s APIs are specific to their service offering, VAST Clusters expose the S3 verbs that are required by most applications benefit from an all-flash on-premises object-store. That excludes tiering, a VAST system provides one tier, all-flash, and AWS specific factors, like the number of availability zones each object should be replicated across.
As with VAST’s NFS offering – S3 performance scales as Enclosure and Server infrastructure is added to a cluster. For example, a Cluster consisting of 10 enclosures can be read from at a rate of up to 230GB/s for 1MB GET operations, or at a rate of 730K random 4KB GETs per second.
The last step in improving the economics of an all flash storage system is extending its lifecycle, Storage systems have historically been designed for a 3-5 year lifecycle, storage vendors enforce this by inflating the cost of support in later years – making it more economical to replace a storage system than renew its support contract. These same vendors then get drunk on the refresh cycle sale and their investors are conditioned on the 3-5 year revenue cycle that comes from this refresh selling motion.
VAST Storage systems are designed to support a 10-year deployment life-cycle. VAST support costs are both low and guaranteed not to raise for a full decade. Users can choose to retire VAST Servers and VAST Enclosures as they add newer, undoubtedly denser and more powerful, models to a cluster based on their business requirements – not because vendors leverage rising support costs against them. Moreover, with the ability to add an arbitrary number of controllers to accelerate access to a Cluster, there’s no need to be enchanted by expensive evergreen storage controller refresh models.
While VAST Servers and VAST Enclosures have a 10-year lifetime, a VAST cluster is immortal for as long as a user wants. New Servers and Enclosures are added to clusters as performance and/or capacity is needed, and older models are aged out as their utility diminishes. While the hardware evolves over time, the cluster lives through a progressive number of expansion and decommission events without ever requiring users to forklift upgrade anything, ever.
Universal Storage frees users from the tyranny of storage tiers with one, single storage solution that provides the best of the capabilities of each of the legacy tiers of storage, and virtually none of the compromises:
… all at a cost that is affordable for all of an organization’s applications.
By replacing this multitude of storage tiers and bringing an end to the HDD era, Universal Storage eliminates the cost and effort of constantly migrating data between tiers and the waste of managing data across multiple silos of infrastructure across each tier. Even more significantly, Universal Storage helps realize the promise of the true all-flash data center so organizations can now analyze, and extract business value from data they otherwise would have relegated to slow, archive storage that cannot support the needs of new data-intense applications.
DASE: The Universal Storage Architecture
To build Universal Storage and bring down the cost of all-flash storage VAST Data had to build a new scale-out storage architecture DASE (Disaggregated Shared Everything). DASE in turn is empowered by three new technologies:
DASE disaggregates the storage media, 3D XPoint and QLC flash SSDs in highly-available enclosures, from the VAST Servers, stateless software controllers that run in containers on commodity x86 servers and appliances.
DASE provides several advantages over conventional storage, including other scale-out systems:
DF-5615 NVMe Enclosure
VAST Quad Server Chassis
®2020 Vast Data, Inc. All rights reserved. All trademarks, service marks, trade names, product names and logos appearing on this page are the property of their respective owners.