The vast majority of today’s scale-out storage systems are based on the shared-nothing architecture, where each node has exclusive ownership of its drives, spinning and/or solid-state. Vendors have built shared-nothing systems to fill almost every niche in storage. SolidFire’s all-flash block arrays and HPC-oriented parallel file systems like BeeGeeFS deliver high-performance from shared-nothing clusters. While object stores like CEPH, Swift, & Scality, and Hadoop’s HDFS for analytics data, target capacity more than performance. Nutanix and Scale Computing put each node on a hypervisor host alongside user VMs creating HCI, yet another shared-nothing use case.
In many ways, today’s shared-nothing storage is a continuation of the software-defined storage movement of the early 2000s. Once Xeons were powerful enough to provide storage services in software without RAID coprocessors and custom ASICs, x86 servers took over as storage system controllers even when, like with Nimble and Tintri, vendors delivered appliances.
So, it was only natural that when storage developers turned their attention to the problem of scale, that they would use x86 servers as their building blocks. They could then write software to aggregate the storage of the many drives across a networked cluster of servers into a storage pool and abstract that storage as a file system, object store, or LUNs.
Shared-nothing nodes come in almost as wide a variety of package options as the x86 servers they’re based on. Vendors offer shared-nothing nodes as standard 1 and 2U pizza boxes, high density server chassis, and blade formats like their compute servers servers. Just as with servers, each form factor has its advantages and disadvantages.
While shared-nothing nodes typically have redundant power supplies and network connections, server motherboards have multiple single points of failure, so shared-nothing systems have to protect their data against node, as well as drive failures. To protect against both node and drive failures, shared-nothing systems replicate or erasure-code data across multiple nodes. Then, should a drive or node fail, the system can rebuild using the surviving replicas or erasure code strips. Since this basically extends the technologies, including single and double parity as erasure-codes, used in RAID from disk drives to cluster nodes, one early shared-nothing vendor dubbed their architecture RAIN for redundant array of independent nodes.
Since motherboards are no more reliable than SSDs shared-nothing systems have to protect their data against node, as well as drive failures.
Shared-nothing systems shard and protect data across their nodes in a cluster in many ways and with varying granularity. For example, some object stores just replicate each object across two or three nodes. Some other simplistic systems literally mirror drive 3 on Node A to drive 3 on Node B, while more sophisticated systems use distributed Reed-Solomon coding of logical blocks.
The Shared-Nothing Latency Conundrum
Regardless of the details, replicating or erasure-coding data across multiple nodes failures adds at least one network round-trip transit time to their write latency, as shown in the diagram below.
When users or compute servers access a shared-nothing cluster, each is only connected to one cluster node. When a user writes data, the node that the user is connected to has to protect that data by replicating or erasure-coding the data to one or more other nodes before acknowledging the write.
Since the node servicing a write has to wait for acknowledgment from all the other nodes involved in that write, the latency of any given write is determined by the slowest node. Thus, if one node is busy for any reason, say the users connected to it are also writing data, that node may take a millisecond or two longer to respond with a corresponding increase in write latency.
Shared-nothing systems can mitigate this indeterminacy with a replicated write cache. Instead of erasure-coding data as it’s written, they store new writes in NVRAM, replicating the newly written data to just one other node’s NVRAM cache before acknowledging the write. They then erasure-code the data when it’s migrated to persistent drives, be they spinning or solid-state. The problem with write-caches, even beyond the cost of NVDIMMs, is that maintaining a coherent cache across multiple nodes is complex and creates a lot of east-west network traffic between the nodes, which can impact performance especially while running parallel applications.
Shared-nothing systems using erasure-codes take that network latency hit on reads as well. Imagine a system that shards files in 128 KB strips across its nodes. When a user issues a 1MB read, the receiving node has to retrieve eight 128 KB strips and reassemble them into a 1 MB reply. If one node is busy, the user has to wait.
This gets even worse in object storage systems using dispersal codes to spread the strips of erasure code stripes across multiple clusters, or data centers. Those dispersal codes can recover data from any 12 of 18 strips in an erasure code stripe. But since dispersal codes don’t use separate data and parity strips like Reed-Solomon, and VAST’s Locally Decodeable Codes do those systems have to have 12 of 18 strips to decode the data for small reads too. When 6 strips are stored in each of three data centers WAN latency has to be included in read latency.
Moving Beyond Shared-Nothing Systems
As we can see from its ubiquity, the shared-nothing architecture was a good solution to the 2000s and 2010s’ problem of scale alone. Users could build PB scale storage systems from the industry standard building blocks of the day, but those systems were never the set it and forget it Redundant Arrays of Independent Nodes that shared-nothing advocates promised. They worked, only when we made sure to check on them before bedtime and tuck them in.
In future installments, we’ll look at how real-world shared-nothing systems fall short of that promised RAIN nirvana, exactly what “Industry standard hardware of the day” was, and how the evolution of storage and networking hardware has undermined the assumptions shared-nothing is built on. In the meantime, if you need help choosing the right storage for your modern workloads, download our File System Comparison white paper.