GATK Reference Architectures with VAST Data

2020 was a tough year for us all – with the world life sciences community scrambling to tame the ravages of COVID-19 and curb its onslaught on our world population. At VAST Data, we wasted no time in doing everything we could to help these researchers, and immediately introduced a program whereby we would provide our exceptional All-Flash storage technology at no cost to research programs around the world. Anything to accelerate time to scientific insight to curb the pandemic. Many world class research institutes took advantage of this program to advance their work.

VAST Data was no johnny-come-lately to the field of Genomics and Life Sciences research before COVID-19 hit us all. The world’s best minds at the National Institute of Health (NIH), Gingko Bioworks and Harvard Medical School, to name a few, had already been using VAST Data technologies for their research clusters well before that. Genomics research has exploded over the past decade moving into an era where a whole human genome can be sequenced, and primary and secondary analysis done in the matter of hours – something that took 10 years to do in 1990 for a single human genome.

But, we wanted to do more that just supply our technology. The key was to accelerate science and we wanted to attack this problem on two fronts – accelerating speed of deployment for the entire solution, and accelerating the time needed for sequence analysis. We turned to our friends Intel to collaborate with them on building a world class analysis stack for Genomics research.

The Broad Institute is a collaboration between Harvard and MIT focused on Genomics research, and is one of the premier research institutions of its kind in the world. The Broad Institute and Intel had developed a reference architecture for this purpose – an implementation to deploy the Broad-Intel Genomics Stack, or BIGstack based on the Genomics Analysis Tool Kit, GATK – a very popular workbench for bioinformatics research around the world.

Based on Intel processors in an HPC configuration, the solution is based on several software components from the industry and Broad Institute such as SLURM, Munge, Cromwell, GATK, BWA, SAMTools and Picard. These were assembled in a tried and tested configuration and allowed for a well understood and repeatable experience for researchers. Traditionally such work was done using a parallel file system for storage, as high throughput NGS analysis is very demanding on storage performance – but came with the complexity and very high operational barriers in operating these.

Enter VAST Data, with exceptional parallel file system performance and scale, but with NFS simplicity and costs approaching hard drive economics. Intel had introduced their Intel Select Solution for Genomics program, and the first step was to become certified as part of that architecture. To do so we built a Reference Architecture compliant system which had to pass a threshold for performance which was then certified by Intel.

VAST Data built this reference architecture, using a 4-node Intel Xeon Platinum 8260L processors, 512 GB memory, and local Intel P4800X Storage Class Memory drives, using the smallest of our VAST Data Clusters and ran the reference workloads as specified by the Intel Select Solution program. We easily surpassed the threshold set for acceptance into the program, comparing well with their previous best results using a parallel file system.

At the peak, for 64 parallel workflows in Cromwell for the baseline workload. The VAST Cluster delivered an aggregate of 19 GB/s to the 4-node SLURM cluster, contributing significantly to the excellent results obtained. The Storage Class Memory on the nodes were only used for temp space, with the data and code completely on the VAST Cluster.

VAST Data is proud to now be part of the Intel Select Solution for Genomics program and use that certification logo to give our mutual customers the assurance they need for the high quality of our joint efforts.

The architecture for the solution is shown below – with a few variants. We used SLURM for a Job Scheduler in our efforts, and as VAST used NFS/RDMA as the connectivity protocol over RoCEv2 Ethernet fabrics, as opposed to Omni-Path. Mellanox ConnectX-5 NICs were used for RDMA connectivity to the VAST Cluster. These are completely transparent to the end-user and researcher, who get the full power of the GATK-based Genomics toolkit with the solution stack they are used to.

While the reference benchmarks gave both Intel and VAST the assurance that we had a highly-performant rock-solid architecture for Genomics analysis in the GATK framework, the real test is of course to use it to do real-life workloads. Researchers often use such machinery to perform Whole Genome Sequencing (WGS) or Whole Exome Sequencing (WES) analysis. Hence we decided to do more detailed testing for WGS workloads to examine how the solution scaled with parallel processing. The WGS analysis pipeline, reproduced from the Intel Select for Genomics Cromwell Workflow Definition Language is shown below.

Click Image to Expand

Starting with data from the sequencers, the idea was to see how a realistic data analysis pipeline that generated VCF files (for tertiary analysis) would perform. The input data is the well-known and widely studied germline NA12878 Human Genome with the Homo Sapiens Assembly HG38 Reference Genome . The data were obtained from the freely available archives on the Google Computing Platform from the Broad Institute.

The results speak for themselves. Single WGS workflows as described above executed in about 14 hrs of wall clock time on the 4-node Xeon Platinum SLURM cluster described above. However, when we scaled these to test the throughput, we were able to execute 20 WGS workflows in parallel in about 29 hours – about 4 WGS workflows/node/day. Reaching higher, we achieved 40 WGS workflows in 49 hours – or about 5 WGS workflows/node/day.

For a modest cluster of this size, this works out to less than 5 hours to perform a complete WGS secondary analysis, significantly accelerating the time before researchers can do tertiary analysis on the Variants identified through the process.

VAST also investigated running GATK WGS workflows on GPU based systems using the GATK based CLARA Parabricks solution from NVIDIA . Preliminary testing on a DGX-2 system with 16 V100 GPUs (only 8 were used for WGS workloads) showed that we can do a WGS analysis pipeline in as little as 40 minutes per genome and two WGS workflows in under an hour. The data used were again the germline NA12878 genome and the HG38 Reference Genome.

The adoption of VAST storage technology in the Genomics arena, and Life Sciences in general, has been amazing to watch over the past year. The reason is simple – VAST tackles the four principal issues for storage in this space with no compromise – performance, scale, cost and complexity. Our customers no longer have to pick the ones they most care about, and live with sub-par properties in the rest.

Simply put – VAST Data Solutions deliver:

  • Uncompromising performance to 100’s of GB/s IO throughput
  • Scale from Petabytes to Exabytes of data linearly
  • Ground-breaking Data Reduction algorithms and Data Protection schemes with cost approaching hard drive economics
  • Simple deployment, scaling and management – the simplicity of NFS and an Always Online experience

Join us in our journey to help better the planet’s health at VAST Data for Life Sciences!