Last Updated: July 16, 2020
VAST Data’s Universal Storage redefines the economics of flash storage, making flash affordable for all applications, from the highest performance databases to the largest data archives, for the first time. The Universal Storage concept blends game-changing storage innovations to lower the acquisition cost of flash with an exabyte-scale file and object storage architecture breaking decades of storage tradeoffs.
With the advantage of new, enabling technologies that weren’t available before 2018, this new Universal Storage concept can achieve a previously-impossible architecture design point. The system combines low-cost QLC Flash Drives and 3D XPoint memory (such as Intel Optane) with stateless, containerized storage services all connected over new low-latency NVMe over Fabrics networks to create VAST’s Disaggregated Shared Everything (DASE) scale-out architecture. Next-generation global algorithms are applied to this DASE architecture to deliver new levels of storage efficiency, resilience, and scale.
While the architecture concepts are sophisticated, the intent and vision of Universal Storage are simple: to bring an end to the data center HDD era and end the complexity of storage tiering that is a byproduct of the decades of compromises caused by mechanical media. This White Paper will introduce you to VAST Data’s Universal Storage, the DASE architecture, and explain how this new architecture defies all conventional definitions of storage. In breaking the classic price/performance tradeoff, this system features all-flash performance at archive economics to simplify the data center and accelerate all modern applications.
Why Universal Storage?
The Tyranny of Tiers
Over 30 years ago, Gartner introduced the storage tiering model as a means to optimize data center costs by advising customers to deprecate older and less-valuable data to lower-cost (and slower) tiers of storage. Fast forward 30 years and the sprawl of storage technologies within organizations has grown to unmanageable proportions – where many of the world’s largest companies can be found managing dozens of different types of storage. This problem is exhibited when defining both storage class (for example: all-flash, hybrid, all-HDD, tape) as well as by classes of protocols (block, file, object, big data, etc.)…. all of it creates a complex pyramid of storage technologies.
While the savings are clear when applying this model with legacy storage architectures, the idea that data should exist on a specific storage tier according to its current value creates multiple challenges:
The Demands of Artificial Intelligence Render Storage Tiering Obsolete
Arguably the greater problem with storage tiering is that this concept assumes that the applications accessing data enjoy a narrow and predefined view of their data access requirements. While that’s true for some applications, such as traditional database engines, new game-changing AI and analytics tools, such as machine learning and deep learning, see value in all data and want the fastest access to the largest amounts of data. For example, when a deep learning system trains it’s neural network model for facial recognition, the model becomes more accurate only once it’s run against all the photos in the dataset, not just the 15-30% that may fit in some expensive flash tier. The value these applications bring is proportionate to the corpus of data they get exposed to, where they thrive with large data sets.
Defining Universal Storage
Universal Storage is a next-generation, scale-out file and object storage concept that breaks decades of storage tradeoffs, and in so doing defies classical storage definitions. Universal Storage is:
New Technologies Lay A New Storage Foundation
There are points in time where the introduction of new technologies make it possible to rethink fundamental approaches to system architecture. In order to realize the Universal Storage architecture vision, VAST made a bet on a trio of underlying technologies that were not available to previous storage architecture efforts, and in fact, only all became commercially viable in 2018. These are:
For the first time in 30 years, a new type of media has been introduced into the classic media hierarchy. 3D XPoint is a new persistent memory technology that is both lower-latency and more endurant than the NAND flash memory used in SSDs while retaining flash’s ability to retain data without external power persistently.
Universal Storage systems use 3D XPoint both as a high-performance write buffer to enable the deployment of low-cost QLC flash for the system’s data store, as well as a global metadata store. 3D XPoint was selected for its low write latency and long endurance. A Universal Storage cluster includes tens to hundreds of terabytes of 3D XPoint capacity, which provides the VAST DASE architecture with several architectural benefits:
NVMe over Fabrics
NVMe (Non-Volatile Memory express) is the software interface that replaced the SCSI command set for accessing PCIe SSDs. Greater parallelism and lower command queue overhead make NVMe SSDs significantly faster than their SAS or SATA equivalents.
NVMe over Fabrics (NVMe-oF) extends the NVMe API over commodity Ethernet and Infiniband networks to provide PCI levels of performance for remote storage access at data center scale. VAST’s DASE architecture disaggregates CPUs and connects them to a globally accessible pool of 3D Xpoint and QLC Flash SSDs to enable a system architecture that independently scales controllers from storage and provides the foundation to execute a new class of global storage algorithms with the intent of driving the effective cost of the system below the sum of the cost of goods. With NVMe-oF, VAST Containers enjoy the advantage of statelessness and shared-everything access to a global pool of 3D XPoint and Flash, with direct-attached levels of storage access performance.
The logic of VAST’s Universal Storage cluster runs in stateless containers. Thanks to NVMe-oF and NVMe Flash and 3D XPoint, each container enjoys direct-attached levels of storage performance without having any direct-attached stateful storage. Containers make it simple to deploy and scale VAST as a software-defined microservice while also laying the foundation for a much more resilient architecture where container failures are non-disruptive to system operation.
Quad-Level Cell Flash (QLC) is the fourth and latest generation in flash memory density and therefore costs the least to manufacture. QLC stores 33% more data in the same space than Triple-Level Cell (TLC). Each cell in a QLC flash chip stores four bits, requiring 16 different voltage levels.
While QLC brings the cost per GB of flash down to unprecedentedly low levels, squeezing more bits in each cell comes with a cost. As each successive generation of flash chips reduced cost by fitting more bits in a cell, each generation also had lower endurance, wearing out after fewer write/erase cycles. The differences in endurance across flash generations are huge – while the first generation of NAND (SLC) could be overwritten 100,000 times, QLC endurance is 100x lower.
Erasing flash memory requires high voltage that causes damage to the flash cell’s insulating layer at a physical level. After multiple cycles, enough damage has accumulated to allow some electron leakage through the silicon’s insulating layer. Since the spatial differences between levels in a QLC flash cell are 1/16th (or less) as big as in an SLC cell, there is simply less real estate for insulation – it’s, therefore, easier for electrons to leak and change the state of a QLC cell from one value to another, hence QLC’s lower endurance.
Several flash vendors have now started talking about, an even denser, 5-bit per cell flash generation. While designs are preliminary this new PLC (Penta Level Cell) flash will store 25% more data per cell than QLC and is projected to have endurance of only a few hundred write/erase cycles per cell, which the VAST architecture is also designed to accommodate.
VAST’s Universal Storage systems were designed to minimize flash wear by both using innovative new data structures that align with the internal geometry of low-cost QLC SSDs in ways never before attempted and a large 3D XPoint write buffer to absorb writes, providing the time, and space, to minimize flash wear. The combination allows VAST Data to warranty QLC or PLC flash systems for 10 years, which has its own impact on system ownership economics.
Scale-out Beyond Shared Nothing
For the past 10 years, the storage industry has convinced itself that a shared-nothing storage architecture is the best approach to achieving storage scale and cost savings. Following the release of the Google File System architecture whitepaper in 2003, it became table stakes for storage architectures of almost any variety to be built from a shared-nothing model, ranging from hyper-converged storage to scale-out file storage to object storage to data warehouse systems and beyond. 10 years later, the basic principles that shared-nothing systems were founded on are much less valid for the following reasons:
The DASE Architecture
VAST Universal Storage is based on a new scale-out architecture concept consisting of two building blocks that are scaled across a common NVMe Fabric. First, the state (and storage capacity) of the system is built from resilient, high-density NVMe-oF storage enclosures. Second, the logic of the system is implemented by stateless docker containers that each has the ability to connect to and manage all of the media in the enclosures. Since the compute elements are disaggregated from the media across a data center scale Fabric, each can scale independently – thereby decoupling capacity and performance.
In this Disaggregated Shared Everything (DASE) architecture, every VAST Server in the cluster has direct access to all the cluster’s storage media with PCI levels of low latency.
VAST Servers provide the intelligence to transform enclosures full of 3D XPoint and QLC SSDs into an enterprise storage cluster. VAST Servers serve file and object protocol requests from NFS, S3, and SMB clients and manage the global namespace, called the VAST Element Store.
The VAST Server Operating System (VASTOS) provides multi-protocol access to the VAST Element store by treating file and object protocols as interchangeable peers. Clients can write a file to an NFS mount, or an SMB share and read the same data as an object from an S3 bucket (and vice-versa). Today, VASTOS supports NFS v3 including NFSoRDMA (NFS over RDMA) and SMB (Server Message Block, the Microsoft protocol previously known as CIFS) file protocols along with the de facto cloud standard S3 object storage protocol. Each server manages a collection of virtual IP addresses (VIPs) that clients mount via round-robin DNS services to balance load across the cluster.
All the VAST Servers in a cluster mount all the storage devices in the cluster via NVMe-oF, providing global and direct access to all the data and metadata in the system. With this global view, VASTOS distributes data management services (erasure encoding, data reduction, etc) across the cluster’s CPUs so that cluster performance scales linearly as more CPUs are added.
VASTOS is deployed in stateless Docker containers to simplify software updates and cluster management across the VAST Server appliances. Containerization of VASTOS provides the benefit of abstracting the cluster’s logic from the underlying server hardware, as such the DASE cluster is designed to support flexible deployment models, including:
Options 2 & 3, today, require VAST to qualify and approve a customer’s operating environment.
The Advantage of a Stateless Design
When a VAST Server receives a read request, the Server accesses persistent metadata that is housed in 3D XPoint across the Fabric in order to locate a file or object’s data, and then reads data from QLC flash (or XPoint if the data has not yet been migrated from the buffer) before forwarding the data to the client. For write requests, the VAST Server writes both data and metadata directly to multiple XPoint SSDs before acknowledging writes. This direct access to shared devices over an ultra-low latency fabric eliminates the need for VAST servers to talk with each other in order to service an IO request – no machine talks to any other machine in the synchronous write or read path. Shared-Everything makes it easy to linearly scale performance just by adding CPUs and thereby overcome the law of diminishing returns that is often found when shared-nothing architectures are scaled up. Clusters can be built from 1,000s of VAST servers to provide extreme levels of aggregate performance… the only scalability limiter is the size of the Fabric that customers configure.
Storing all the system’s metadata in shared, persistent XPoint SSDs eliminates the need to maintain any coherency between Servers and eliminates the need for power failure protection hardware that would be otherwise required by volatile and expensive DRAM write-back caches. VAST’s DASE architecture pairs 100% nonvolatile media with transactional storage semantics to ensure that updates to the Element Store are always consistent and persistent.
VAST Servers do not, themselves, maintain any local state – thereby making it easy to scale services and fail around any Server outage. When a VAST Server joins a cluster, it executes a consistent hashing function to locate the root of various metadata trees. As Server resources are added, the cluster leader rebalances responsibility for shared functions. Should a Server go offline, other Servers easily adopt its VIPs and the clients will connect to the new servers within standard timeout ranges upon retry.
This Shared Everything cluster concept breaks the rigid association that storage systems have historically built around specific storage devices within a cluster. A VAST Cluster will continue to operate and provide all data services even with just one VAST Server running, as all state is stored in a set of globally-accessible and resilient storage enclosures. If, for example, a cluster consisted of 100 Servers, said cluster could lose as many as 99 machines and still be 100% online.
VAST Enclosures are resilient, NVMe-oF storage enclosures that connect XPoint and QLC flash SSDs to a high-throughput Ethernet or InfiniBand network. The VAST Enclosure features no single point of failure – Fabric modules, NICs, fans, power supplies are all fully-redundant such that VAST Clusters can be built from as little as one Enclosure and scale to 1,000 Enclosures.
Per the figure above, each VAST Enclosure houses two Fabric Modules that are responsible for routing NVMe-oF requests from Ethernet or InfiniBand ports to the Enclosure’s SSDs through a complex of PCIe switch chips. With no single point of failure from network port to SSD, VAST Enclosures combine enterprise-grade resiliency with high-throughput connectivity. While at face-value the architecture of a VAST Enclosure appears similar to a dual-controller storage array, there are in reality several fundamental differences:
Fabric Failover With Single Ported SSDs
Server Pooling and Quality of Service
The VAST Servers in a cluster can be subdivided into Server Pools that create isolated failover domains, making it possible to provision the performance of a arbitrarily-sized pool of servers to a set of users or applications to isolate application traffic and ensure a quality of ingress and egress performance that’s not possible in shared-nothing or shared-disk architectures.