Nico Budi Darmawan Tan - Simple Outside, Complicated Inside.: Hadoop Buyer's Guide

Continuation from part 1 >>> Click Here...

... 4 Critical Consideration for RFP ...

1. Performance and Scalability
In its earliest days, Hadoop was primarily used to crawl and index the Web, which was less sensitive from that standpoint than many current use cases. Today, growing numbers of Hadoop projects are being tasked with delivering actionable results in real-time, or near real-time. Not surprisingly, the definition of Hadoop’s performance has evolved in lockstep: in the earliest days, fast throughput was the primary metric; now, it includes low latency. This recent emphasis on low latency places intense focus on two major attributes of any Hadoop platform:

Its raw performance. This refers to everything from how quickly it ingests information and whether this data is immediately available for analysis, to its latency for real-time applications and MapReduce speeds.
Its ability to scale. This describes how easily it can expand in all relevant dimensions, such as number of nodes, tables, files, and so on. Additionally, this shouldn’t impose heavy administrative burdens, any changes to application logic, or excessive costs.

This section explores a number of capabilities that are directly related to how well your Hadoop implementation will be able to perform — and scale.

#architectural foundations for performance and scalability
For specific features that should be present in your Hadoop environment, have a look at Table 1, which itemizes a number of critical architecture preconditions that can have a positive impact on performance and scalability.

#streaming writes
Given that Hadoop typically is meant to work with massive amounts of information, the job of loading and unloading data must be as efficient as possible. Yet many Hadoop distributions require complex and cumbersome batch or semi-streaming processes using technologies such as Flume and Scribe. To make things worse, these deep-seated inefficiencies are magnified when data volumes are in the gigabytes to terabytes and beyond.

A better technique is for your Hadoop implementation to expose a standard fileninterface that lets your applications access the Hadoop cluster as if it was traditional Network Attached Storage (NAS). Application servers are then able to directly write information into the Hadoop cluster, as opposed to first staging it on local disks. Data bound for Hadoop can also be automatically compressed on the fly as it arrives, and it’s immediately available for random read and write access by applications through multiple parallel, concurrent connections. These immediate interactions permit the real-time Hadoop-based decision-making described earlier.

Consider an online gaming company that’s relying on Hadoop to track millions of users and billions of events. There are very short windows of opportunity to introduce virtual goods to players, because these users tend to come and go very quickly. Fortunately, real-time or near real-time analysis on streaming data helps increase revenue by making it possible to offer timely suggestions. Although projects like Apache Drill are meant to facilitate rapid decision-making, this isn’t possible unless the raw data itself arrives in the Hadoop cluster as speedily as possible.

#scalability
IT organizations eager to capitalize on Hadoop are often faced with a conundrum: either acquire more hardware and other resources than will ever be necessary and thus wastefully expend scarce funds and administrator time, or try to squeeze as much as possible from a relatively limited set of computing assets and potentially miss out on fully capitalizing on their Big Data.

A scalable Hadoop platform can help balance these choices and thus make it easier to meet user needs while staying on budget.

Recall from earlier that a given Hadoop instance’s scalability isn’t measured on a single scale. Instead, you should take several factors into consideration:

Files. Hadoop’s default architecture consists of a single NameNode. This constrains Hadoop clusters to a (relatively) paltry 100 million to 150 million files, a number that’s also impacted by the amount of memory available for file metadata. And in small clusters, ceilings on the number of blocks on each data node further constrain the number of available files. Look for a Hadoop platform that avoids the single NameNode bottleneck and has distributed metadata architecture, and can thus scale to billions ― or even trillions ― of files and tables.
Number of nodes. Another dimension of scale is the number of physical nodes. Depending on the processing or data storage requirements your selected Hadoop implementation might need to scale to 1,000 nodes and beyond.
Node capacity/density. In addition, for storage intensive use cases you need to scale through nodes with higher disk densities. This serves to reduce the overall number of nodes required to store a given volume of data.

#real-time nosql
More enterprises than ever are relying on NoSQL-based solutions to drive critical business operations. The only way for these new applications to achieve the reliability and adoption of RDBMS-based solutions is for them to conform to the same types of rigorous SLAs that IT expects from applications built on relational databases. For example, NoSQL solutions with wildly fluctuating response times would not be candidate solutions for core business operations that require consistent low latency.

Apache HBase is a key-value based NoSQL database solution that is built on top of Hadoop. It provides storage and real-time analytics for Big Data with the added advantage of MapReduce operations using Hadoop. About 30-40% of Hadoop installs today are estimated to be using HBase. Despite its advantage of integrating with Hadoop, HBase has not reached its true adoption potential because of several limitations in its performance and dependability.

Fortunately, there are a number of innovations that can transform HBase applications to meet the stringent needs for most online applications and analytics. These include:

Reducing the overall number of layers
Eliminating the need for Java garbage collection
Eliminating the need for manually pre-splitting tables
Distributing metadata across the cluster, rather than on a single NameNode
Avoiding compactions and the related I/O storms that these trigger

2. Dependability
You can expect Hadoop to be subject to the same dependability expectations as every other type of enterprise software system. You can also anticipate that the same IT administrators who are caring for the rest of your IT assets will also manage your Hadoop implementations.

To reduce the overall burden on users and administrators alike, the most successful Hadoop infrastructure will be capable of coping with the inevitable problems encountered by all production systems. Many of these reactions should be automated to further enhance dependability. This section reviews several traits of Hadoop platforms that have been architected to thrive in the most stressful
environments.

#architectural foundations for dependability
Table 2 depicts several foundational principles that help increase the dependability of your Hadoop implementation.

#high availability
High availability (HA) refers to the propensity of a Hadoop system to continue to service users even when confronted with the inevitable hardware, network, and other issues that are characteristic to distributed computing environments of this size and complexity.

To deliver the availability that you’ll need for mission-critical production applications, your Hadoop environment should incorporate each of these HA capabilities:

HA is built-in. First and foremost, it shouldn’t be necessary to perform any special steps to take advantage of HA; instead, it should be default behavior for the platform itself.
Meta data. A single NameNode that contains all meta data for the cluster represents a single point of failure and an exposure for HA. A solution that distributes the meta data coupled with failover has HA advantages and as an added benefit, there’s no practical limit on the number of files that be supported.
MapReduce HA. One important aspect of HA is how MapReduce jobs are impacted by failures. A failure in a job or task tracker can impact the ability to meet SLAs. Determine whether MapReduce HA includes automated failover and the ability to continue with no manual restart steps.
NFS HA. This offers high throughput and resilience for NFS-based data ingestion and access.
Recovery time from multiple failures. One of the areas of differentiation across Hadoop distributions is the time and process it takes to recover files in case of a hardware, user or application error, including the ability to recover from multiple failures. How soon are files and tables accessible after a node failure or cluster restart? Seconds? Minutes? Longer?
Rolling upgrades. As Hadoop and its complementary technologies evolve, you should be able to upgrade your implementation without needing to incur any downtime.

#data protection
For growing numbers of organizations, Hadoop is driving crucial business decisions that directly impact the bottom line. This is placing heightened emphasis on safeguarding the data that Hadoop processes. Fortunately, well-proven techniques such as replication and snapshots have long been fundamental building blocks for protecting relational data, and they each have a role to play in shielding Hadoop’s information as well.

Replication. This helps defend Hadoop’s data from the periodic failures you can expect when conducting distributed processing of huge amounts of data on commodity hardware. Your chosen platform should automatically replicate ― at least 3X ― Hadoop’s file chunks, table regions, and metadata, with at least one replica sent to a different rack.
Snapshots. By offering point-in-time recovery without data duplication, Hadoop Snapshots provide additional insurance from user and application errors. If possible, your Hadoop platform’s capabilities should permit snapshots to share the same storage as live information, all without having impact on performance or scalability. You should also be able to read files and tables directly from a snapshot. Snapshots go beyond mere data protection. For example, data scientists can use a snapshot to aid in the process of creating a new model. Different models can be run against the same snapshot, isolating results to model changes.

#disaster recovery
Hadoop is particularly prone to events that have the potential to significantly disrupt business operations, because:

It’s commonly deployed on commoditized hardware
It stores enormous amounts of information
Its data is distributed, and networks are prone to sporadic outages
Today’s IT environments are routinely subject to attack

Since there’s such a good chance that you’ll encounter an emergency, you would be wise to employ mirroring as a preventative measure that can help you recover from even the most dire situations.

Mirroring. Your Hadoop mirroring should be asynchronous and perform auto-compressed, block-level data transfer of differential changes across the WAN. It should mirror data as well as its meta data, while maintaining data locality and data consistency at all times. This ensures that applications can restart immediately upon site failure. It should also have the following characteristics:

3. Manageability
Early in Hadoop’s history, it was fairly common for sophisticated developers with source code-level understanding of Hadoop to manage multiple Hadoop environments. This could work because these developers detailed knowledge of Hadoop internals and because they had combined developmental and operational responsibilities as is typical in startups. This clearly won’t translate into mainline IT usage because it simply is not feasible for an operations team to handle many different systems in addition to Hadoop. Total cost of ownership (TCO) is always a major consideration when IT compares solutions, and it’s especially relevant in Hadoop environments.

Seek out a Hadoop platform that supplies comprehensive, intelligently designed tooling that eases administrative burdens. As Hadoop continues to mature, Hadoop distributions will compete on the quality and depth of their management tools in each of these critical areas:

Administration

Volume-based data and user management
Centralized node administration and troubleshooting
Adding and removing disk drives directly through a graphical user interface (GUI)
Rolling upgrades that permit staggered software upgrades over a period of time without disrupting the service
Automated and scheduled administrative tasks
Multi-tenant user access with data and job placement control

Monitoring

End-to-end monitoring of the Hadoop cluster, from the application to the hardware level, including detecting disk failures
Alerts, alarms, and heat maps that provide a color-coded, real-time view of the nodes, including their health, memory, CPU, and other metrics
Integration, via a REST API, into different open source and commercial tools as well as the ability to build custom dashboards
Visibility through standard tools like Ganglia and Nagios

Finally, Hadoop implementations routinely scale to hundreds ― or thousands ― of nodes. Attempting to manage the configuration, deployment, and administration of all these nodes is a chore that should be automated as much as possible. Fortunately, leading operating system vendors are continually refining their automated configuration and service orchestration solutions. For example, Juju from Canonical offers both a graphical user interface and a command line interface that lets administrators automate all facets of their distributed processing environments.

4. Data Access
Gobbling up colossal arrays of information is only the beginning of your Hadoop story. To unlock all of your data’s potential value, you need a Hadoop platform that makes it easy to ingest and extract this information quickly and securely, and then lets your developers build fully capable applications using well-proven tools and techniques. It’s even more auspicious if your existing applications can easily connect to Hadoop’s data.

This section is all about making sure that your appointed Hadoop platform will interact smoothly with the rest of your IT environment.

#architecture foundations for data access
Before reading the suggestions for enhancing data access in Hadoop, take a look at picture below for some fundamentals.

#standard file system interface and semantics (posix)
A POSIX file system that supports random read/write operations on Hadoop as well as providing NFS access opens up Hadoop to much broader usage than is commonly found with the default HDFS. This also simplifies tasks that would otherwise have required much more complex processes.

Administrators and users should be able to mount the cluster over the network like enterprise NAS. Browsers such as Windows Explorer, Mac Finder, IDEs, and standard Linux file interaction commands like ls, grep and tail ― will thus all be able to work directly with the cluster. With the Hadoop cluster treated like part of the file system, users can drag and drop data into Hadoop or hit the Tab key to autocomplete instructions on its command line interface.

Going beyond new Hadoop-based applications, other solutions ― legacy or new, and written in your choice of programming language ― can use the file system to access and write data on Hadoop. A POSIX file system also makes it straightforward to import and export information to/from relational databases and data warehouses using the standard tools without a need for special connectors.

For example, a retailer could quickly load data in parallel using standard tools through NFS. Data will be streamed in directly, and won’t require creating sequential writes that will slow down the entire process.

#developer tools
Developers have decades of experience employing popular tools and methodologies for interacting with relational databases. While Hadoop introduces new paradigms and concepts, you should seek out platforms that boost developer productivity by:

Offering open source components on public GitHub for download and customization
Making binaries available through Maven repositories for faster application builds
Providing a workflow engine for building applications more quickly
Enabling standard development tools to work directly with data on the cluster
Permitting existing non-Hadoop applications and libraries written in any programming language to be able to access and write data on Hadoop
Supplying SQL-like interactive query capabilities

#security
Scarcely a day goes by without a news headline about a data breach or other major security violation, often involving Big Data. Given the amount of information stored in Hadoop ― and the broad range of this data ― it’s essential that you take proactive steps to protect your data before your organization is featured on the news. Sadly, some Hadoop implementations are so challenging to secure that customers avoid the subject entirely, and never actually enable security.

Rather than referring to a single capability, your Hadoop security should be far-reaching, and encompass each of the following safeguards:
Fine-grained permissions on files, directories, jobs, queues, and administrative operations
Access control lists (ACLs) for tables, columns and column families
Wire-level encryption between the Hadoop cluster and all external cluster access points both natively and through third parties
Standard authentication protocols such as Kerberos, LDAP, Active
Directory, NIS, local users and groups, and other 3rd party authentication and identity systems
Simple yet secure access to the cluster through a “gateway node,” while blocking direct interaction with all other nodes

...

Comparing Major Hadoop Distributions

Just about every organization is seeking ways to profit from Big Data, and Hadoop is increasingly serving as the most capable conduit to unlock its inherent value. This means that Hadoop is likely to have a major role to play in your enterprise. Given this probability, you should carefully consider your choice of Hadoop implementation, and pay particular attention to its performance/ scalability, dependability, and ease of data access. In particular, make sure that your selection conforms to the way you operate, and not the other way around. On the next picture is a quick comparison chart of some of the differences across the major Hadoop distributions.