Achieve high bandwidth performance with NFS enhancements from VAST Data

What was old is new again! The Network File System protocol is the grizzled remote data access veteran harking back to 1984 and has been a tried and tested way to access data from another server, while preserving the hierarchical file and directory semantics native to Unix and Linux. Being a Layer 7 protocol in the OSI stack, modern implementations of NFS use TCP (and RDMA) as the underlying transport across the network.

In this article, we’ll examine some of the innovative work VAST Data has delivered to our customers for NFS3. VAST also supports access over SMB and S3, as well as through a Kubernetes CSI – but these will not be discussed here. In addition, we recently introduced NFS4 support at VAST. A more detailed discussion of our implementation of  NFS4 features can be found in this blog post.

NFS3 is well defined in the standards community in RFC 1813 and while its availability is almost universal across operating systems and networks, it has never enjoyed a solid reputation for high bandwidth performance.

Enter VAST Data

We figured throwing out the baby with the bath water made no sense. Why not leverage the simplicity, solid install base, and standards-based implementation of NFS and improve the performance and availability to get the best of both worlds.The complexities of parallel file systems simply aren’t worth the hassle when NFS can be equally performant.This article examines the different methods of accessing data from a VAST Cluster using NFS3 that help you deliver high bandwidth performance, which are game-changing especially in your GPU accelerated environments.

VAST Data supports 4 different modes of NFS, and the same client can use any combination of these at the same time for different mount points. They differ in the underlying transport protocol (TCP vs RDMA) and the introduction of new features in the upstream Linux kernel for NFS around multiple TCP connections between the client and storage system.

NFS/TCP

This is the existing standards based implementation present in all Linux kernels. This sets up one TCP connection between the client port and one storage port, and uses TCP as the transport for both metadata and data operations.

# This is an example mount commands (1 local ports to 1remote ports):
sudo mount -o proto=tcp,vers=3 172.25.1.1:/ /mnt/tcp
# This is standards based syntax for NFS/TCP - the proto=tcp is default.

While the easiest to use and requiring no installation on non-standard components, this is also the least performant option. Typically we see about 2-2.5 GB/s per mount point for this with large block IOs and about 40-60K 4K IOPS. Performance in this case is limited for two reasons: all traffic is sent to a single storage port on a single VAST C-node, and a single TCP socket is used which hits up against TCP limitations.

NFS/RDMA

This too has been a capability in most modern Linux kernels for many years. Here the connection topology is still the same – single connection between one client port and one storage port – however the data transfer occurs using RDMA thus increasing the throughput. The use of RDMA bypasses the TCP socket limitations. This implementation requires:

  • an RDMA-capable NIC e.g. Mellanox ConnectX series
  • Jumbo frame support in the RDMA network
  • an OFED from Mellanox and
  • an installable package (.rpm or .deb) for NFS/

Technically some versions of  Linux include an NFS/RDMA package, but we strongly recommend using the VAST version, as it fixes several issues with the stock package.

From a kernel support perspective, VAST supports several OS variants (CentOS, Ubuntu, SUSE,…), Linux kernels (3.x, 4.x, 5.x) and MOFEDs (4.4 onwards) and can provide a build for any specific kernel/MOFED combination that is needed. Please refer to the table at the end of this post for more information.

# This is an example mount for NFS/RDMA command (1 local ports to 1remote ports):
sudo mount -o proto=rdma,port=20049,vers=3 172.25.1.1:/ /mnt/rdma
# This is standards based syntax for NFS/RDMA. Port 20049 is also standard for NFS/RDMA and is implemented in VAST
With NFS/RDMA, we are able to achieve 8-8.5 GB/s per mount point with large IOs (1 MB), while IOPS remains unchanged relative to the standard NFS over TCP option.

NFS/TCP with nconnect:

This is an upstream kernel feature but exclusive to Linux kernels after 5.3. This requires no specialized hardware (no RDMA, NICs,…) and works “out-of-the-box”. Here the NFS driver allows for multiple connections between client and one storage port – controlled by the nconnect mount parameter. Up to 16 TCP connections can be created between the client and storage port with nconnect. The transport protocol is TCP as is the case with standard NFS/TCP. Using nconnect bypasses the single TCP connection limitations.

# This is an example mount command for kernel 5.3+ nconnect NFS/TCP (1 local ports to 1 remote ports):
sudo mount -o proto=tcp,vers=3,nconnect=8 172.25.1.1:/ /mnt/nconnect
# This is standards based syntax for NFS/TCP. Note that nconnect is limited to 16.
# Once again, the proto=tcp is default - the command will simply not work for the wrong kernels.
# The default port is specified as 20048 for nconnect and is implicit - use the "-v" flag for details if curious

The upstream kernel nconnect feature can provide close to line bandwidth for a single 100 Gb NIC. IOPS remain unchanged as one storage port is in use as with the previous options. For example, we can achieve 11 GB/s on a single mount point with this on a Mellanox ConnectX-5 100 Gb NIC.

Multipath NFS/RDMA or NFS/TCP:

This option takes the features of NFS nconnect from the 5.3+ Linux kernels, combines it with TCP or RDMA as the transport, and enhances the connection capabilities to the storage.

First, nconnect here provides the ability to have multiple connections between the client and the storage – however, the connections are no longer limited to a single storage port, but can connect with any number of storage ports that can serve that NFS filesystem. Load balancing and HA capabilities are built-in into this feature as well, and for complex systems with multiple client NICs and PCIe switches, NIC affinity is implemented to ensure optimal connectivity inside the server.

A key differentiator with Multipath is that the nconnect feature is no longer restricted to the 5.3+ kernels, but has been backported to lower kernels (3.x and 4.x) as well, making these powerful features available to a broad ecosystem of deployments. Typical mount semantics differ from normal NFS mounts in a few ways. See example below.

This is an example mount commands (4 local ports to 8 remote ports):
```
sudo mount -o proto=rdma,port=20049,vers=3,nconnect=8,localports=172.25.1.101-172.25.1.104,remoteports=172.25.1.1-172.25.1.8 172.25.1.1:/ /mnt/multipath
```
The code changes in this repository add the following parameters:
localports    A list of IPv4 addresses for the local ports to bind
remoteports   A list of IPv4 addresses for the remote ports to bind
IP addresses can be given as an inclusive range, with `-` as a delimiter, e.g.
`FIRST-LAST`. Multiple ranges or IP addresses can be separated by `~`.

The performance we are able to achieve for a single mount far exceeds any other approach. We have seen up to 162 GiB/s (174 GB/s) on systems with 8×200 Gb NICs with GPU Direct Storage, with a single client DGX-A100 System.

Additionally, as all the C-nodes participate to deliver IOPS, an entry level 4 C-node system has been shown to deliver 240K 4K IOPS to a single client/single mount point/single client 100 Gb NIC system. We are designed to scale this performance linearly as more C-nodes participate.

Conclusion

These connectivity options for NFS are powerful methods to access data over a standards based protocol, NFS. The modern variants, and the innovation that VAST Data has brought to the forefront, has changed the landscape of what NFS is capable of. The Table below summarizes the current status of these mount options, and their relative performance

NFS connection method Kernel compatibility (M)OFED Requirements Single-mount BW (READ) Single-mount IOPS (READ)
Standard NFS/TCP All (2.6+) None 2-2.5GB/sec (1 client 100 Gb/s NIC) 40-60K 4K Read
Standard NFS/RDMA On request Most MOFEDs from Mellanox 8-8.5 GB/sec (1 client 100 Gb/s NIC) 40-60K 4K Read
NFS/TCP + stock Nconnect 5.3 and up none 10-11 GB/sec (1 client 100 Gb/s NIC) 40-60K 4K Read
Multipath NFS/RDMA

or NFS/TCP.

(NFS/RDMA also supports GDS and NIC Affinity)

Requires VAST NFS Client.

Fedora and Debian forks

3.10, 4.15, 4.18, ,5.3 and 5.4  kernels. Others on request.

Most MOFEDs from Mellanox for x86.

 

Multipath NFS/TCP can be supported without a (M)OFED as well.

 

Up to 162 GiB/sec (174 GB/s) – with 8×200 Gb/s IB CX-6 client NICs – using one DGX-A100 server and GPU Direct for Storage  (needs RDMA). Scales linearly with the number of CNodes in the VIP pool. 4-cnodes give 200K-240K 4K IOPS

All these approaches are available with VAST Data’s Universal Storage. Some are standard, some have kernel limitations, and some need RDMA support with some (free) software from VAST Data. VAST Data is working with NFS upstream maintainers to contribute our code to the Linux kernel (a first tranche has been submitted for the 5.14 kernel) and open source the work – an effort that we hope will converge in the coming year.

In the meantime, read this jointly produced reference architecture from VAST Data and NVIDIA to learn how your organization can implement a turnkey petabyte-scale AI infrastructure solution that is designed to significantly increase storage performance for your GPU accelerated environments.

General References

Reinventing Scale-Out Storage in the AI Era — VAST Forward Ep. 1

Meet VAST Forward, a new video series here to help educate storage practitioners and decision makers on how VAST can solve their infrastructure challenges and result in better technical and business outcomes.

Watch Now