The Rule of Nines

Authored by

Howard Marks, VAST Technologist Extraordinary and Plenipotentiary

This blog post was written in 2022 and reflects product capabilities at that time. Some information may be outdated.

As storage administrators, and storage vendors, we are the custodians of the world’s data. As a result when we talk about storage reliability the conversation generally revolves around data durability, a subject we’ve talked about before. While not losing a customer’s data is the storage system’s most important task to be useful a storage system must not only store data, but also make it available for access and processing.

In theory VAST systems are designed for 100% reliability, in practice we haven’t quite been able to reach that goal. Over the past 180 days the 150 some odd VAST clusters that send telemetry to VAST Insight reported 99.9995% availability.

But what does five, six or even seven nines of reliability really mean to you, the storage operator? For a single system, five nines uptime translates to just over five minutes of downtime a year. Each additional 9 represents a 10-fold increase in reliability, and therefore 1/10th the allowed downtime as reliability asymptotically approaches 100%.

Availability	Downtime / 6 months
99.999%	2.6 minutes
99.9999%	15.7 seconds
99.99999%	1.57 seconds

Single System Availability and Downtime

For a larger pool of systems, like the VAST installed base, we calculate reliability as the total number of downtime hours across any VAST cluster divided by the total number of cluster-hours (4380 hours for our six month sample) :

99.9995% = Sum(Hours cluster unavailable)/(Number of Clusters * 4380)

Five nines availability in the field is impressive for any storage system. A couple of factors make it especially so for a vendor like VAST that makes storage systems at large scale, let alone a storage vendor that’s only been shipping production systems for three years.

The first is that large scale. The average VAST cluster has over 6 PB of effective capacity making the average VAST cluster bigger than the maximum size of many other all-flash systems. Reliability is just harder at scale, traditional multi-petabyte systems have more devices to break from SSDs to x86 server nodes, and lots of software moving parts keeping data safe and all those caches up to date.

The other factor making five and a half nines availability more impressive is that storage vendors who sell an exabyte of storage in 4 PB clusters will sell 1/20th as many clusters as the vendor selling 200 TB systems. That means every moment of downtime has a significant impact on the overall availability percentage.

Simple algebra shows that to achieve 99.9995% availability over 150 clusters means that across all those customers, and clusters there was a total of under four hours of downtime.

Availability percentages don’t of course tell the full tale. When you’re the customer there’s a big difference between a 10 minute, or two-hour outage, and that time your storage was down for three days. Five and a half nines availability may only allow a few total hours of downtime for VAST but for that vendor with 3000 systems you can’t tell if their 80 hours of downtime are 20 customers each taking a 4 hour outage, or a couple of more catastrophic failures.

High Availability By Design

The key to VAST’s high-availability, like most things at VAST, is the distributed, shared-everything architecture (DASE) that uses stateless protocol servers, sometimes called Cnodes, and highly available storage enclosures, sometimes called Dboxes where data, and the all-important system metadata and other state is stored. Every Cnode has direct access to all the data, and state information, in the shared Dboxes allowing any Cnode to take over for a Cnode that goes offline. The highly available Dbox similarly transparently fails over when any internal component in the data path goes offline.

DASE doesn’t just provide redundancy for every component at every level of the system. DASE also simplifies the job VAST’s software has to do. Since every Cnode has direct access to the shared SCM SSDs they can read and write directly to and from the SCM eliminating the complexity of NVRAM caches, maintaining cache coherency across multiple nodes and coordinating writes across nodes. When a Cnode receives new data it writes that data to a pair of SCM SSDs directly, a much simpler process.

A highly resilient architecture, and simpler software that’s got less moving parts to break are just the beginning, the real key to VAST’s availability in the field is our fanatical aversion to downtime. We just hate it, we test every edge case and permutation we can think of, and once a system is up and running we avoid it like the plague. We aim to make our systems not just reliable, but 100% available with no scheduled or unscheduled downtime.

We know that a 100% reliability goal is hard for any storage vendor, and even harder for a vendor whose average system holds multiple petabytes across hundreds of SSDs. VAST’s DASE Architecture is the key, sharing the data, metadata and system state with all the stateless Cnodes makes DASE both similar and more resilient than conventional shared-media and shared-nothing systems.

We measure how close we come to that goal by the rule of nines. The more 9s the VAST Insight console (pictured above) shows for fleet availability, the fewer, and shorter, the problems our customers have had, and the happier we are.

The Rule of Nines

High Availability By Design

More from this topic