What is MapReduce in Hadoop? How Does it Work?

Posted in

What is MapReduce in Hadoop? How Does it Work?
ramyashankar

Ramya Shankar
Last updated on December 21, 2024

    The world is witnessing the generation of massive amounts of data. But the major challenge is how to process that collected unstructured data efficiently and speedily to transform it into actionable information. This is where the MapReduce programming model comes into play. MapReduce, one of the core components of the Apache Hadoop ecosystem, has earned a lot of popularity due to its capability of splitting and processing large amounts of data parallelly.

    This article will help you learn what MapReduce is and how it works. Later, we will walk you through some significant benefits of MapReduce.

    So, without further ado, let us begin.

    What is Hadoop MapReduce?

    It is a programming model and a software framework in the Hadoop ecosystem. It enables us to write applications to process large volumes of data in parallel across multiple servers in a cluster. Being a processing unit, MapReduce is the heart of the Hadoop framework.

    ‘MapReduce’ refers to two different tasks, namely map and reduce; the map job is followed by the reduce job. It first splits large data sets into smaller chunks and then processes them in parallel on the Hadoop commodity cluster. The smaller processed chunks of data act as an input to the ‘reduce’ job. MapReduce converts the data into the form of a key-value pair or tuple.

    Finally, it aggregates all key-value pairs from multiple servers in a cluster to provide a consolidated output or small set of tuples. Also, MapReduce in the Hadoop ecosystem is responsible for scheduling and monitoring tasks.

    Let us say that we have a Hadoop cluster of 20,000 inexpensive commodity servers, with each server having a capacity of processing 256 MB of data. With this Hadoop cluster, we can process around 5 TB of data parallelly, which is fast and efficient.

    In MapReduce, data storage and access are disk-based, i.e., the input is present in the form of a file or directory and is stored in HDFS, and the output is also stored as files in HDFS. Earlier, MapReduce was the sole approach to access or retrieve the data stored in the Hadoop Distributed File System (HDFS), which is Hadoop’s data storage system. However, today, we have Hive and Pig query-based systems that use SQL-like statements to retrieve data from HDFS.

    How Does MapReduce Work in Hadoop?

    MapReduce processes data sets in four phases: Mapping, Shuffling, and Reducer, while the fourth phase is an optional phase called Combiner. But before diving deeper into the working of MapReduce, let us first understand the different components of MapReduce architecture. Having knowledge of these components can help you understand better how MapReduce works.

    • Job: It is actual work that needs to be accomplished.
    • Task: A task is a small portion of a job that needs to be executed.
    • Job Tracker: It is responsible for scheduling jobs. Also, it keeps track of all the jobs assigned to the task tracker.
    • Task Tracker: It tracks all the tasks and reports their status to the job tracker.
    • Input Data: It is the data fed to the mapping phase of MapReduce.
    • Output Data: It is the result of both mapping and combining.
    • Client: It is an Application Program Interface (API) that assigns jobs to MapReduce.
    • Hadoop MapReduce Master: It is responsible for splitting a job into smaller parts, called job parts.
    • Job-parts: They are small portions of the main job.

    Phases of MapReduce

    Following are the four phases of MapReduce:

    1. Mapping Phase

    The mapping phase involves two steps: splitting and mapping. To begin with, the splitting step breaks down the input data set into smaller units, called chunks. The Hadoop framework has a RecordReader that converts these chunks into key-value pairs. Next, the mapping step involves a mapper that takes key-value pairs as its input. A mapper applies coding logic to these key-value pairs and produces the output in the same format.

    2. Shuffling Phase

    The shuffling phase also involves two steps: sorting and merging. In the first step, all key-value pairs are sorted based on the keys, and the later step combines all the sorted key-value pairs. Shuffling ensures no duplicate values, and merging ensures that all values with the same keys are grouped together. The output of this phase is in the key-value pair format.

    3. Reducer Phase

    The reducer phase takes the output of the shuffling phase as its input and reduces it into smaller values. This phase produces the output that reflects the summary of the entire dataset and stores it in HDFS.

    4. Combiner Phase

    The primary role of the combiner phase is to optimize the MapReduce process by reducing the network traffic between the mapping and reducer phases.

    Working

    Initially, a client submits a job to the Hadoop MapReduce Master, where it gets divided into small subparts, called job-parts. A developer will write a code or logic as per their company’s requirements to process the data.

    Next, the input data is fed to MapReduce, where it gets split and mapped into intermediate data, called the mapping phase. This intermediate data then undergoes the shuffling phase, and the resulting data is fed to the reducer, which generates the desired output and stores it in HDFS.

    How does MapReduce Organizes Work?

    MapReduce uses a Job tracker and a Task tracker to organize the work. Each job submitted to the system for execution has one job tracker on Namenode and multiple task trackers on Datanode. When Hadoop MapReduce Master splits a job into multiple tasks, a job tracker schedules and runs them on different data nodes in a cluster. Each data node has one task tracker that monitors a task and sends its progress report to the job tracker. In this way, the job tracker manages the progress of each task. If any task fails, it reschedules it for execution on a different task tracker.

    Benefits of MapReduce

    Here are some significant benefits of the MapReduce model:

    1. Speed

    The processing speed of MapReduce is astonishing. It has the capability to process large volumes of unstructured data in a short time, i.e., terabytes of data in minutes and petabytes in hours.

    2. Scalability

    As Hadoop distributes large data sets on multiple servers in a cluster, it is highly scalable. Adding one server to a cluster increases the processing power. Traditional database management systems cannot scale to process such terabytes or petabytes of data. MapReduce allows businesses and organizations to run their applications that involve the usage of terabytes of data from multiple nodes.

    3. Flexibility

    Hadoop MapReduce enables organizations and businesses to access and work on new sources of data, whether they may be structured or unstructured. Therefore, they can leverage all data collected from different data sources, like social media, email, clickstream, etc.

    4. Security

    Security is the core of any business or organization. Any unauthorized access to terabytes or petabytes of an organization’s data is terrible. Hadoop MapReduce ensures the security of data stored in HDFS and HBase and allows only authorized users to access and operate on it.

    5. Parallel Processing

    Parallel processing is one of the most significant and notable benefits of MapReduce, which is responsible for ensuring high processing speed. It splits a job into multiple units and runs them on different servers parallelly.

    6. Data Availability

    When we send data to a node in HDFS, it gets replicated and stored on other nodes of the same network. Therefore, even if the data on one node is lost, you can get its copy from other nodes in a network.

    7. Fault-Tolerance

    Hadoop MapReduce can identify faults that occur in the framework and take quick actions to recover them. Therefore, it can handle failures with ease.

    Conclusion

    MapReduce is the core of the Hadoop framework, and it performs data processing quickly and efficiently. It is a highly scalable, secure, and fault-tolerant framework suitable for processing huge amounts of data generated by entertainment, social media, and e-commerce websites.

    E-commerce companies like Walmart, Amazon, and eBay use this programming model to analyze customers’ buying behavior. Social media platforms, like LinkedIn, Facebook, and Twitter, use MapReduce to evaluate user information. Also, one of the most popular entertainment platforms, Netflix, uses MapReduce to analyze customers’ clicks and logs for suggesting movies or web series based on their interests.

    People are also reading:

    FAQs


    The two primary phases of MapReduce are the mapping phase and the reduction phase.

    Google introduced MapReduce with the aim of simplifying data processing by processing large volumes of data parallelly on different computer systems.

    MapReduce in Hadoop is not able to cache the immediate data in memory for future requirements.

    Leave a Comment on this Post

    0 Comments