While I’m not a big fan of industry predictions, they are a bit like death and taxes in the storage industry… can’t avoid them. In the process of working up our calls for 2021, I found myself thinking about the diversity of use cases VAST Data has encountered this year and how a surprising number of them called for capabilities well beyond what users have been accustomed to getting from NFS. Since I’m a fairly busy bee, I don’t always have the chance to sit back and see the big picture, but our end-of-year thinking has slapped me in the face with an absolute truth that I didn’t fully appreciate whilst jumping from deal-to-deal throughout the year… here goes:
Legacy (TCP, unipath) NAS will not survive the jump to 4th generation PCIe
Now, there will be use cases which of course are an exception to this rule, such as home directories that serve desktops, but any file system client machine that’s looking to make use of all of its processing cores and/or drive efficient use of accelerators (GPUs, FPGAs, etc.) are now forced to abandon traditional NAS options as PCIe Gen4 sets into the market.
The Core of the Issue
To start this off, let’s set the foundation of this discussion around the definition of ‘legacy’ and where this problem emerges from how NFS was originally architected.
- NFS has historically been TCP-based. While TCP and UDP are both options for NFS packet delivery, TCP is the transport of choice because of its stricter ordering and delivery semantics. Protocol overhead comes from the processing of destination and source IPs, order sequencing, checksums and more. These operations are performed on a client’s CPU and consume significant CPU cycles especially for smaller packet sizes (MTU=1500). While RDMA has emerged to offload much of the protocol overhead and keep I/O processing out of the client’s kernel, VAST is the only enterprise solution for NFS over RDMA… NetApp, Dell EMC Isilon and Pure’s Flashblade are limited to TCP-based access.
- NFS client mounts have historically been limited to a single client connection mounted from a single core, connecting to storage via a single server connection. The original NFS standard was, in essence, point-to-point. NFS clients, therefore, has historically mounted remote storage using a single client CPU core. Therefore, IP protocol handling is also limited to a single core.
Enter CPUs and Moore’s Law. Squeezing out nanometers is hard, and clock frequency gains are even harder to achieve. As the process of increasing core frequency has hit a wall, the CPU manufacturers introduced multi-core designs to encourage application parallelism. Intel introduced its first dual-core processor in 2005. The Intel® Pentium® Processor Extreme Edition 840 ran at 3.2 GHz … and 15 years later their fastest Xeon® 2nd Generation Scalable Processors top out at 3.8GHz. 15 years, and only 19% more compute resources for TCP overhead handling.
The knee in the evolutionary performance curve of single NFS mountpoint performance can be seen very clearly in this overview of how processors have evolved over the past 5 decades (focus in on the plateau in Frequency (MHz)):
Performance-wise, a single 3+GHz core can drive about 2GB/s (16Gbps) of NFS traffic per client. This end of Moore’s Law for single-core performance improvements has resulted in a very awkward waltz between NFS and CPUs over the years.
|Customers wrestled with the IP overhead of NFS for compute bound applications… I/O took too much system resource. In those days, RDMA was the only recourse to free the CPU.||Here, a single core could handle IP packets and happily feed a few other application processors. That core could saturate a single-port or dual-port 10Gb NIC… which was fine for most applications.||Now, the single-core nature of NFS and the limits of IP-based I/O from that core have starved all other cores. A single core can’t drive more than 1/5th of a 100Gb NIC.|
Admittedly, the above is mostly theoretical, so let’s put single-core NFS over IP and NFS over RDMA performance into perspective, when comparing to modern client processors.
As per the above, a few things become clear when applied against modern processors:
- Single-connection NFS over RDMA is 5x faster than single-connection TCP-based NFS on a 100Gb link, but neither is enough to saturate a file system client
- There’s a clear distinction between PCIe Gen3 and PCIe Gen4 based systems… and once Intel Gen4-based server processors, this will only exacerbate the differences between the needs of a processor and what legacy NFS can deliver
- CPUs are only part of the challenge, and as other accelerators sit behind CPUs and require all (or more) of a CPU’s bandwidth just to be fed with data, systems must be able to side-step the memory bandwidth limits of modern processors and feed GPUs directly
PCIe Gen4… the reckoning for single-core and IP-based NFS
We’ve been waiting for the broad availability of PCIe gen4 hardware for years, and with the imminent release of Intel’s first Gen4-based server architectures, we’ll finally be to the point where a single lane of PCIe Gen4 will drive as much NFS/IP traffic as a single core can support. The evolution of PCIe performance continues to outpace core frequency and CPU HW assist evolutions. CPUs have bumped into the limits of the speed of light – and legacy NFS is the victim. At the high end, AMD architectures have 128 lanes of Gen4 PCIe per socket… establishing an I/O imbalance that’s two orders of magnitude large… and the problem will only double when PCIe Gen5 systems start shipping in 2022.
|Raw Bit Rate||Link BW||BW/Lane/Way||Total BW x16|
NFS is Dead, Long Live NFS Express
Despite its performance shortcomings, NFS is great for the simple reason that it’s a battle-tested standard and customers love standards and most people don’t want to confuse their operating system agenda with their storage agenda. Parallel file system drivers that must be installed and maintained on host machines have always added a layer of complexity that shows its head most often during host or storage upgrade events, as there are always firmware interdependencies with the host OS and the file system driver.
Fortunately, there is hope. In the past 10 years, the market has up-leveled the NFS client to support the rising requirements of the multi-core era. While much of the legacy old-guard NAS market hasn’t evolved with NFS, VAST Data has.
The recipe for standards + performance is surprisingly simple:
- RDMA All The Way – CPU Bypass: By supporting the NFS over RDMA standard that’s been in the linux kernel for nearly 10 years, VAST has taken the baton that Oracle first started running with in 2014, hardening NFS over RDMA and making sure that everything we do is contributed back to the community. RDMA offloads all of the I/O operations onto the Ethernet or InfiniBand NIC and keeps any packet processing out of the Linux kernel… freeing cores from the overhead of network I/O. Today, we have customers who run RDMA on both InfiniBand (IB) and Ethernet (ROCEv2) networks and see up to 90% of line rate utilization.RDMA is also essential to lay the foundation for advanced forms of communication to NVIDIA GPUs, see point #3 below.
- Multiple Connections per Mountpoint, to Parallelize IO Across Client Cores: By supporting and adapting the new nconnect capability that has recently made its way into the upstream kernel and therefore modern enterprise Linux distributions (eg: RHEL), customers can now open up all of their cores, lanes and NICs to broadcast I/O into an out of a client as much as their hearts are content to. A picture tells 1,000 words, so we’ll use the NVIDIA DGX-A100 as an example of how this all connects:
In the diagram above, there are 8 RDMA connections going into a DGX-A100 across eight ports, where each connection is managed by its own CPU core (of which there are many in this 128-core machine). From a user perspective, this new nconnect multi-connection option is simple to mount:
mount -o proto=rdma, nconnect=8, localports=10.0.0.1-10.0.0.8, remoteports=10.0.0.9-10.0.0.16 10.0.0.3:/ /mnt/export
- And Finally, Bypass Host Memory To Serve I/O At PCI Speeds: For an example of how CPUs and even CPU memory can be the bottleneck for specific applications, look no further than the NVIDIA DGX A100 system. As shown in the comparison of peak performance above, 8 x A100 GPUs are organized in a single host that has 2 x AMD Rome processors as the x86 resource in the system. Organizations like NVIDIA are keen to express the full capabilities of their AI accelerators, and NVIDIA has (in particular) developed a new method of RDMA’ing a NFS packet directly from a file server directly into the GPU memory… thus bypassing the CPU and CPU memory altogether, eliminating a data copy within DGX-class systems and resulting in spectacular levels of remote storage bandwidth.This fall, we’ve been working extensively with NVIDIA’s GPUDirect Storage (GDS) team to implement this host memory bypass for NFS and work to approach the upper limits of 8 x 200Gb NICs you can find in an A100. The results speak for themselves:
Universal Storage DGX A100 Benchmarking
RDMA + Multiple Connections + GPUDirect Storage = 81,000% more performance than legacy NAS. When we started working with them, NVIDIA prototyped all of the GDS work on their own, such that there were no server-side modifications needed to drive very fast performance even in the earliest days. That’s the benefit of having a standards-based, linearly-scalable architecture as we’ve engineered at VAST Data.
As mentioned at the beginning… this story is bigger than just AI, although it must be said that deep learning training may not run effectively at all without these enhancements. We see the need for significant client bandwidth beyond AI in the media space, in life science, in big data analytics, in HPC… even in enterprise backup environments. These applications have always had a good appetite for data.
The applications haven’t changed, but the underlying computing infrastructure has, and legacy NFS has not evolved in a manner consistent with how processing architectures and applications have evolved. It will be interesting to see if and how the scale-upA storage system architecture that allows storage systems to scale in capacity by adding drives (flash or spinning) to one or two controllers that provide a fixed amount of processing power. Compare to Scale-out. and scale-outScale-Out storage systems allow users to grow a system by adding compute power along with media in the form of HDDs and SSDs. Scale-Out storage systems can usually scale larger than systems based on the older scale-up architecture. Compare to Scale-up toaster companies of yesterday re-work their architectures. Fortunately, this new “NFS Express” is faster than ever and couples with VAST Data’s Universal StorageUniversal Storage is a single storage system that is fast enough for primary storage, scalable enough for huge datasets and affordable enough to use for the full range of a customers data, thus eliminating the tyranny of tiers. to make it simple for customers to marry their evolving compute agenda with storage that’s fast, scalable and standards-based (so, easy to operate). This is one of the many reasons why IDC calls VAST Data’s new concept “the storage architecture of the future”.
To learn more about how to unleash your file applications, just reach out to chat.