Distributed Processing Methodologies
In the past, organizations that wanted to work with large information sets would have needed to:
- Acquire very powerful servers, each sporting very fast processors and lots of memory
- Stage massive amounts of high-end, often-proprietary storage
- License an expensive operating system, a RDBMS, business intelligence, and other software
- Hire highly skilled consultants to make all of this work
- Budget lots of time and money
- Commodity hardware
- Distributed file systems
- Open source operating systems, databases, and other infrastructure
- Significantly cheaper storage
- Widespread adoption of interoperable Application Programming Interfaces (APIs)
In a nutshell, these distributed processing methodologies are constructed on the proven foundation of ‘Divide and Conquer’: it’s much faster to break a massive task into smaller chunks and process them in parallel. There’s a long history of this style of computing, dating all the way back to functional programming paradigms like LISP in the 1960s.
Given how much information it must manage, Google has long been heavily reliant on these tactics. In 2004, Google published a white paper that described their thinking on parallel processing of large quantities of data, which they labeled “MapReduce”. The white paper was conceptual in that it didn’t spell out the implementation technologies per se. Google summed up MapReduce as follows:
“MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.”
MapReduce was proven to be one of the most effective techniques for conducting batch-based analytics on the gargantuan amounts of raw data generated by web search and crawling before organizations expanded their use of MapReduce to additional scenarios.
Rather than referring to a single tactic, MapReduce is actually a collection of complementary processes and strategies that begins by pairing commoditized hardware and software with specialized underlying file systems. Computational tasks are then directly performed on the data wherever it happens to reside, rather than the previous practices of first copying and aggregating raw data into a single repository before processing it. These older practices simply won’t scale when the amount of data expands beyond terabytes. Instead, MapReduce’s innovative thinking means that rather than laboriously moving huge volumes of raw data across a network, only code is sent over the network.
MapReduce was, and continues to be, a superb strategy for the problem that it was originally designed to solve: how to conduct batch analysis on the massive quantities of data generated by users running searches and visiting web sites. The concepts behind MapReduce have also served as the inspiration for an ever-expanding collection of novel parallel processing computational frameworks aimed at a variety of use cases, such as streaming analysis, interactive querying, integrating SQL with machine learning, and so on. While not all of these new approaches will achieve the same level of traction as the popular and still-growing batch-oriented MapReduce, many are being used to solve interesting challenges and drive new applications.
Conveniently, each of these methodologies shields software developers from the thorny challenges of distributed, parallel processing. But as Robert D. Schneider described earlier, Google’s MapReduce paper didn’t dictate exactly what technologies should be used to implement its architecture. This means that unless you worked for Google, it’s unlikely that you had the time, money, or people to design, develop, and maintain your own, site-specific set of all of the necessary components for systems of this sophistication. After all, it’s doubtful that you built your own proprietary operating system, relational database management system, or Web server.
Thus, there was a need for a complete, standardized, end-to-end solution suitable for enterprises seeking to apply the full assortment of modern, distributed processing techniques to help extract value from reams of Big Data. This is where Hadoop comes in.
Around the same time that Google was publishing the MapReduce paper, two engineers - Doug Cutting and Mike Cafarella - were busily working on their own web crawling technology named Nutch. After reading Google’s research, they quickly adjusted their efforts and set out to create the foundations of what would later be known as Hadoop. Eventually, Cutting joined Yahoo! where the Hadoop technology was expanded further. As Hadoop grew in sophistication, Yahoo! extended its usage into additional internal applications. In early 2008, the Apache Software Foundation (ASF) promoted Hadoop into a top-level open source project.
Simply stated, Hadoop is a comprehensive software platform that executes distributed data processing techniques. It’s implemented in several distinct, specialized modules:
- Storage, principally employing the Hadoop File System (HDFS) although other more robust alternatives are available as well
- Resource management and scheduling for computational tasks
- Distributed processing programming model based on MapReduce
- Common utilities and software libraries necessary for the entire Hadoop platform
Hadoop has broad applicability across all industries.
Enterprises have responded enthusiastically to Hadoop. Table 3 below illustrates just a few examples of how Hadoop is being used in production today.
Selecting your Hadoop infrastructure is a vital IT decision that will affect the entire organization for years to come, in ways that you can’t visualize now. This is particularly true since we’re only at the dawn of Big Data in the enterprise. Hadoop is no longer an “esoteric”, lab-oriented technology; instead, it’s becoming mainline, it’s continually evolving, and it must be integrated into your enterprise. Selecting a Hadoop implementation requires the same level of attention and devotion as your organization expends when choosing other critical core technologies, such as application servers, storage, and databases. You can expect your Hadoop environment to be subject to the same requirements as the rest of your IT asset portfolio, including:
- Service Level Agreements (SLAs)
- Data protection
- Integration with other applications
Checklist: Ten Things to Look for When Evaluating Hadoop Technology
1. Look for solutions that support open source and ecosystem components that support Hadoop API’s. It’s wise to make sure API’s are open to avoid lock-in.
2. Interoperate with existing applications. One way to magnify the potential of your Big Data efforts is to enable your full portfolio of enterprise applications to work with all of the information you’re storing in Hadoop.
3. Examine the ease of migrating data into and out of Hadoop. By mounting your Hadoop cluster as an NFS volume, applications can load data directly into Hadoop and then gain real-time access to Hadoop’s results. This approach also increases usability by supporting multiple concurrent random access readers and writers.
4. Use the same hardware for OLTP and analytics. It’s rare for an organization to maintain duplicate hardware and storage environments for different tasks. This requires a high-performance, low-latency solution that doesn’t get bogged down with time-consuming tasks such as garbage collection or compactions. Reducing the overhead of the disk footprint and related I/O tasks helps speed things up and increases the likelihood of efficient execution of different types of processes on the same servers.
5. Focus on scalability. In its early days, Hadoop was primarily used for offline analysis. Although this was an important responsibility, instant responses weren’t generally viewed as essential. Since Hadoop is now driving many more types of use cases, today’s Hadoop workloads are highly variable. This means that your platform must be capable of gracefully and transparently allocating additional resources on an as-needed basis without imposing excessive administrative and operational burdens.
6. Ability to provide real-time insights on newly loaded data. Hadoop’s original use case was to crawl and index the Web. But today – when properly implemented – Hadoop can deliver instantaneous understanding of live data, but only if fresh information is immediately available for analysis.
7. A completely integrated solution. Your database architects, operations staff, and developers should focus on their primary tasks, instead of trying to install, configure, and maintain all of the components in the Hadoop ecosystem.
8. Safeguard data via multiple techniques. Your Hadoop platform should facilitate duplicating both data and metadata across multiple servers using practices such as replication and mirroring. In the event of an outage on a particular node you should be able to immediately recover data from where it has been replicated in the cluster. This not only fosters business continuity, it also presents the option of offering read-only access to information that’s been replicated to other nodes. Snapshots - which should be available for both files and tables - provide point-in-time recovery capabilities in the event of a user or application error.
9. Offer high availability. Hadoop is now a critical enterprise technology infrastructure. Like other enterprise-wide fundamental software assets, it should be possible to upgrade your Hadoop environment without shutting it down. Furthermore, your core Hadoop system should be isolated from user tasks so that runaway jobs can’t degrade or even bring down your entire cluster.
10. Complete administrative tooling and comprehensive security. It should be easy for your operational staff to maintain your Hadoop landscape, with minimal amounts of manual procedures. Self-tuning is an excellent way that a given Hadoop environment can reduce administrative overhead, and it should also be easy for you to incorporate your existing security infrastructure into Hadoop.