Similarity Reduction: Report From the Field

Author

Howard Marks

The aim of the VAST Data Platform is to break many of the storage trade-offs that customers have wrestled with for decades, imposed by the legacy systems they’ve been deploying. Most significantly, the problem VAST is most focused on is the tradeoff between the low cost of high-capacity archive storage and the performance of expensive flash that leads too many organizations toward the tyranny of tiered storage.

For many applications, data reduction is one of the keys to bringing the cost of all-flash storage more inline with what customers have historically spent for disk. All-flash storage also helps with data reduction because deep pattern matching requires random access media. To close the cost gap, VAST has pioneered a new form of data reduction which combines the global approach of deduplication with the fine granularity of local compression.

It seems counterintuitive, but since reading reduced data creates many random I/Os, flash’s random access properties are the enabling capability that make it possible to bring the cost of an all-silicon storage system down to HDD economics. We are so confident in our new approach to similarity reduction that we guarantee that it will reduce data better than any other storage system.

With the release of VAST version 3, we now have more than a year of experience with data reduction across our customers, I thought this a good time for a brief review of our customer’s reactions to VAST reduction.

The Tale of the Tape

The most important measure of any data reduction system is its effectiveness; our customer’s systems report their Data Reduction Ratio (DRR) to the VAST Insight ET phone home analytics platform. Those results include:

Backup Target: 3:1 (66% savings) on pre-deduplicated and compressed backups
Genomics: 2:1 data reduction (50% savings)
Splunk: 3.5:1 reduction of compressed index and log files (71% savings)
Seismic Data: 2.5:1 (60%)
Animation/VFX files: 3:1 (66%)
Quantitative Trading Market Data: 8:1 (88% savings)

Remember, these reduction ratios are for live data. Since VAST data reduction is always-on. All of our performance specs and benchmarking results are generated with reduction enabled. We believe that data reduction can provide meaningful efficiency gains as we work to end the HDD era.

Performance matters. When we presented the performance results of one customer’s POC (Proof of Concept), they asked us to repeat the testing with our data reduction features turned on. They were pretty shocked when we told them not only had we run their application with data reduction enabled but that we had no OFF switch for data reduction, we reduce all the data all the time. Apparently, according to this customer, it was standard operating procedure for storage vendors to turn off data reduction when measuring performance but to include the savings from data reduction when pricing a system. We don’t have data reduction to satisfy RFP checkmarks; it’s a core part of the VAST Data Platform philosophy.

Before VAST

Before VAST, even though several all-flash block arrays used compression and deduplication large scale (multi-petabyte), unstructured data stores didn’t reduce data very well. Some systems compressed data, but didn’t deduplicate. Others relegated deduplication to scheduled post-process jobs that created so much cluster activity most users disable reduction.

This mediocre at best data reduction is understandable when you remember that scale-out NAS systems were originally architected roughly 20 years ago, solidly in the hard drive era when both CPU cycles and IOPS were a lot more expensive than they are today and since disk space was comparatively cheap the architects at Isilon, and the others, decided reduction wasn’t at the top of their priority list.

DRAM cost limits deduplication realm scale as dedupe map is replicated across controllers/nodes

Deduplication in particular has always presented storage system architects with a clear trade-off between the greater data reduction deduplicating on small, say 4KB, blocks, delivers and lower CPU, memory and IOPS demands of large, say 128KB, block deduplication. The greater reduction of small block deduplication is clear; the system has to store only a small block, rather than a block 4-128 times as large, when the data changes just a few bytes.

The IOPS amplification of deduplication is less obvious. As data is stored on a deduplicating storage system it’s broken into blocks that are stored pretty randomly across the storage system’s media. When the storage system receives a 1MB read request, it has to rehydrate the original data from those randomly stored blocks. The result is a single read request gets amplified into many random reads across the systems drives, For a system using 4KB dedupe blocks that 1MB read expands to 256 I/Os or thousands of IOPS from just a few video streams.

Deduplication and spinning disks just don’t get along except for the coldest of data. If data is well deduplicated, it will generate many IOPS to read, and disk performance will plummet. VAST’s all-flash design provides enough IOPS to deliver both performance and data reduction, breaking yet another trade-off.

While many scale-out NAS vendors now also offer an all-flash option of their systems, their architectures have deep hard disk roots. They’ve optimized their data placement for hard drives and minimizing head motions because with disks space is cheap, but head motions are expensive. Vendors can’t abandon their disk roots and rethink their data layout without discontinuing the hybrid and HDD models in their line, hobbling the performance of those systems, or forking their product line and wasting their engineering spend on parallel and competing efforts.

Conventional deduplication also has scaling issues. Since most deduplication systems keep their hash tables in DRAM for fast lookups, the size of a controller’s DRAM limits how much data it can deduplicate. As a result, even the largest storage systems that feature effective deduplication only hold about 1PB of data.

The VAST Solution: Similarity and DASE

VAST’s DASE scales data reduction much further. VAST’s DASE architecture breaks the trade-off between data deduplication and scale by storing all the hash tables and other reduction metadata in the large pool of Storage Class Memory across the VAST enclosures instead of controller DRAM. Since each enclosure adds more Storage Class Memory, and therefore more hash space, along with its QLC flash, a VAST cluster can scale to exabytes as a single data reduction realm.

VAST stores data reduction metadata in Storage Class Memory Scales with enclosures/capacity, shared by all VAST Servers

VAST’s Similarity reduces data where others fail to. Our Similarity Reduction, described in detail in this blog post, reduces data in similar blocks down to minimum sized delta blocks by compressing similar blocks together. The result is a best of all worlds method that not only compresses data with a state-of-the-art algorithm, and saves the extra space taken by duplicate copies of blocks like deduplication, but also reduces the space taken when two blocks are only similar, containing common data smaller than the block size.

VAST not only delivers sub-millisecond access times, we deliver that all-flash performance while reducing bigger data sets than anyone else, better than anyone else.

The boffins in the VASTcave are constantly improving Similarity Reduction, squeezing our customer’s data into smaller and smaller amounts of flash. Stay tuned to this channel where I’m sure I’ll be reporting on how VAST’s DASE architecture, with data reduction metadata in Storage Class Memory, is powering our customers to reduce more data into less space.

Similarity Reduction: Report From the Field

The Tale of the Tape

Before VAST

The VAST Solution: Similarity and DASE

More from this topic