Hadoop is a massive job parallelizer. This means certain jobs (like MapReduce) can be executed in a massively parallel way without requiring significant per-job configuration of a cluster. It also has the added benefit of executing jobs with built-in redundancy to reduce job fails based on individual node failures.
It can handle massive amounts of data (150 PB in some of the largest Hadoop clusters). The engine is optimized around massive data sets and has a distributed filesystem that is designed for large files and distribution of data around a cluster to minimize network traffic during job execution.
What is Hadoop Not Good For?
Hadoop is not an online data retrieval system. Each job takes seconds to days to execute and is not designed for a direct replacement for relational databases in real-time data processing. Hadoop is complementary to online systems that can be a source of data that the online systems simply can’t process. Some of the largest Hadoop clusters are paired with TeraData clusters for online systems and utilize an ETL process for moving data from Hadoop jobs into TeraData for online use.
Jobs that are highly dependent on steps are not appropriate for Hadoop clusters as massive parallelism requires independence for jobs in order to fully utilize a cluster and receive the benefits of Hadoop.
I get it! How Do I Program Hadoop?
There are a handful of programming languages built to execute Hadoop jobs. The basic job running on Hadoop is a Java program built in MapReduce.
Java MapReduce Jobs
This is the lowest level language for Hadoop jobs. This is essentially a Java class with a configuration, Map, and Reduce step. Each Map step can return only a single data type as a list of key-value pairs. Each Reduce step will take a Key-Value pair list for a single key and do some sort of reduction to a single value for a given key. Needless to say this process can get very complex when you talk about joining data sets or other more complex data operations.
Pig is another functional language used to build simple data analysis scripts to run on a Hadoop cluster. This language provides a higher-level implementation that allows for a much simpler code-base to execute relatively complex MapReduce jobs. This is usually the first language one learns and uses when implementing a Hadoop cluster.
Scalding is a more advanced language built on top of the Cascading. This is another high-level functional language that compiles down into Hadoop MapReduce jobs for execution. This is built to give more functional capabilities than Pig such as unit testing and application portability outside of Hadoop (i.e. this code can be executed using non-Hadoop infrastructure as well).
In short, Hadoop provides a robust method for massive data set processing on an offline basis. It can run massively parallel jobs that process petabytes of data, but it is not designed for fast lookups nor for fast queries. That is where a relational database comes in handy for storing features extracted from massive data sets. As an example, eBay’s search is powered by TeraData but TeraData is fed from their massive Hadoop cluster (over 10,000 nodes and 150 PB of data).
As an application developer, the main focus is submitting jobs using one of the high-level languages (Pig, Scalding, Hive, etc). As an example, AWS provides ample support for submitting Hadoop jobs to dynamic clusters that can be created and destroyed on-demand to provide cost-effective big data processing. However, each on-demand cluster can take up to 5 minutes to fully provision.