Genomics leader PacBio is a pioneer in long-read DNA sequencing, empowering scientists to better understand genomes for research ranging from healthcare to agriculture. However, generating extreme amounts of data from developing and manufacturing PacBio’s highly accurate sequencers like the Revio and Sequel IIe Systems presents massive data challenges.
Adam Knight, Director of IT Infrastructure at PacBio, explained, “For us, it’s the time it takes to sequence genomes, the cost of sequencing, and the error rates and accuracy that are important. We’re trying to increase accuracy, speed, and throughput, while reducing error rates and overall cost of sequencing runs.” Knight added, “We had to do something different. It had to be more than just storage; it had to be a seamless fit into our existing operations.” Knight continues, “Our sequencers produce gigabytes of data per sequencing run for our customers, so you can imagine how much data we need to generate in order to design, develop, and manufacture an instrument platform like Revio. We also have a geographically distributed team working around the world, so we were facing delays due to the logistics of getting data to the right people at the right time.”
With the totality of datasets growing into the petabyte range, performance and scalability were top concerns. Like many growing companies, PacBio’s infrastructure sprawled across locations and legacy infrastructure. Knight noted, “We faced challenges with diverse file access protocols and bottlenecks when moving data between our instruments and our network file shares.” Consolidating on a high-performance and easily scalable data platform became a priority.