Data analytics - and the storage systems that enable data analytics workloads - have evolved greatly in the last decade. At the start of the 2010s, big data frameworks like Hadoop became popular as data volumes continued to grow. Hadoop was originally designed and implemented as a batch processing framework for semi-structured data. Users initially focused on running post-processing of data, where jobs could take hours or days.
But today's workloads demand a platform that can service not only the batch-oriented workloads but also ad-hoc/interactive query and even real-time analytics. Hadoop, and particularly HDFS (the Hadoop Distributed File System, which stores files locally on nodes across the cluster while managing consistency), was not designed for these use cases, and so a shift is underway to move to more agile and performance-centric systems.
The Hadoop Legacy
Over time Hadoop has waned in popularity, but the underlying concepts it introduced - MapReduce as a computing paradigm; the tight coupling of storage and compute to reduce network traffic; and the use of commodity hardware to build structured and semi-structured analytics platforms - became the core of modern data analytics systems like Databricks and birthed a rich ecosystem of analytics projects like Hive, Kafka, and Impala.
Hadoop was developed at Yahoo in the early 2000s. The initial idea came from a pair of Google papers (“The Google File System” and “MapReduce: Simplified Data Processing on Large Clusters”), and allowed Yahoo to increase processing power while using low-cost, commodity-grade hardware. Hadoop was and still is an Apache top-level open source project, allowing any organization or user to download and freely use the software.
Hadoop offered the promise to disrupt the data warehouse and analytics market that had long been a stronghold for companies such as Oracle, Teradata, and IBM. While MapReduce, a data parallelism framework, has faded into the history of technology, HDFS still remains prevalent as the file system supporting big data applications like Spark and Kafka.
Drives were faster than network adapters when HDFS came into being, so it was deployed as a shared cluster of commodity servers with direct-attached storage devices, allowing the compute to be brought closer to the media. At that time solid state drives (SSDs) were expensive and limited in capacity and therefore, spinning disks (HDDs) were used. Additionally, most networking was still gigabit (1 Gbit/sec) which meant that the only way to avoid network bottlenecks was to keep storage and compute tightly coupled. (This concept, known as a shared-nothing storage architecture, was introduced in the 2003 Google File System white paper.) For example, 12 HDDs can deliver approximately 1GByte/sec. However, when using a gigabit network, throughput is limited to approximately 100MByte/sec, making the network a bottleneck.
As big data systems evolved beyond MapReduce into more real-time solutions such as Spark and Kafka, the storage needed to get faster and support data analytics techniques. It also needed to scale to thousands of nodes, providing nearly infinite capacity.
Understanding Lambda Architecture and Big Data Processing
The core of most big data workflows is a pattern called lambda architecture. This workload depends on events coming in at scale—typically this data would come from sources like log files or IoT sensors, then flow into a streaming platform like Kafka. Kafka is an open source distributed event streaming platform that can process those streams of data.
You can think of a stream processor as a single query that runs against all the values streaming through. That query acts as a conditional split—for example, if you were capturing sensor data, you might want to capture those values so that action could be taken. This could be in the form of executing a function in the event of a computer system, or an operator physically examining a device. The rest of the values will be streamed in a “cold” layer, allowing for larger-scale analysis of all the values for trending and machine learning services.
That cold layer is commonly a data warehouse or a combination of HDFS and Spark. One of the things that hampered Hadoop in physical infrastructure with local storage is that to add storage capacity, you also had to add compute (again due to the tight coupling of CPU + capacity in shared nothing architectures), which was costly in terms of actual costs and management. This architecture is similar in concept to that of hyperconverged infrastructure, where in order to get more compute or storage capacity, you have to buy both disks and servers.
Two major outcomes arose from Hadoop’s challenges. One is that as organizations turned to object storage as a means of cost-effective storage for big data, the S3 API quickly became an industry standard, enabling organizations to store, retrieve, list, delete, and move data across almost any object store. The other technological breakthrough came more recently in the form of the disaggregated shared everything (DASE) architecture, which allows IT organizations to scale cluster storage independently from compute to better meet the capacity and performance application needs.
The Deep End of the Data Lake
The other invention that helped spur modern storage is the data lake concept, in which the data awaiting analysis is stored in its raw format (typically on a distributed file system).
The hardest part of any business intelligence or data analysis system is taking the data from its raw format into a format where it can be easily queried. Traditionally, this has been done through a process called Extract, Transform, and Load, or ETL. This process has been modified to the current approach of Extract, Load, and Transform (at analysis), where the transformation takes place at the query layer. This allows different applications to have access to the raw data, enabling them to do only the transformations that they need.
Spark is built to work with data lakes, and is fast and flexible; it integrates with a variety of data lakes and supports a wide variety of languages. You can use Scala, SQL, or Java, and Jupyter Notebooks, which enable less advanced users to perform advanced tasks. Spark uses memory to join data, and can act like a scale-out data warehouse, with much lower costs. Its quick adoption and widespread community of users has helped it become one of the most popular data analysis tools of the current era.
However, beyond the frameworks that enable large-scale data analysis, there have also been substantial changes in the underlying storage that enable scale, speed, and volume for data analytics. When Hadoop was built, it was designed around very dense, very cheap 7200 RPM mechanical hard drives. Solid state storage has evolved to be much cheaper, denser, and faster through protocols like NVMe that offer extremely low levels of latency, and very dense SSDs. This speed, combined with cost reductions for SSDs, means you have highly scalable and affordable all-flash storage to support your data lake.
Things Will Never Be the Same
As you can see, storage has evolved a great deal over the years. It’s gotten faster, has a much greater ability to scale up, and has been decoupled from the underlying hardware. How to protect, manage, and pull meaningful insights out of the data residing in that storage is the next great challenge. VAST Data can help.
The disaggregated shared everything (DASE) architecture of the VAST Data Platform allows customers to benefit from real-time storage performance for all data analytics use cases, ranging from data warehouse analytics and ad-hoc queries to complex data science jobs.
VAST Data’s goal is to help companies get more out of their storage by unlocking greater operational efficiency and faster performance to implement advanced analytics. If you’d like to learn more, schedule a demo or reach out to hello@vastdata.com.