What is MapReduce?
MapReduce is the processing layer of Hadoop. MapReduce programming model is designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.
Work (complete job) which is submitted by the user to master is divided into small works (tasks) and assigned to slaves.
Here in MapReduce, we get inputs from a list and it converts it into output which is again a list. It is the heart of Hadoop. Hadoop is so much powerful and efficient due to
MapRreduce as here parallel processing is done.
Basic Terminologies
MapReduce Job
MapReduce Job or a A “full program” is an execution of a Mapper and Reducer across a data set.
Task in Map Reduce
A task in MapReduce is an execution of a Mapper or a Reducer on a slice of data. It is also called Task-In-Progress (TIP). It means processing of data is in progress
either on mapper or reducer.
Map Execution Phases
Let me explain the execution in details along with the phases below :
Map Abstraction
The map takes key/value pair as input. Whether data is in structured or unstructured format, framework converts the incoming data into key and value.
Key is a reference to the input value.
Value is the data set on which to operate.
Map Processing:
A function defined by user – user can write custom business logic according to his need to process the data.
Applies to every value in value input.
Map produces a new list of key/value pairs:
An output of Map is called intermediate output.
An output of map is stored on the local disk from where it is shuffled to reduce nodes.
Reduce Abstraction
Reduce takes intermediate Key / Value pairs as input and processes the output of the mapper. Usually, in the reducer, we do aggregation or summation sort of computation.
Input given to reducer is generated by Map (intermediate output)
Key / Value pairs provided to reduce are sorted by key
Reduce processing:
A function defined by user – Here also user can write custom business logic and get the final output.
Iterator supplies the values for a given key to the Reduce function.
Reduce produces a final list of key/value pairs:
An output of Reduce is called Final output.
It can be a different type from input pair.
An output of Reduce is stored in HDFS.
How Map and Reduce work Together?
Concept : By default on a slave, 2 mappers run at a time which can also be increased as per the requirements. It depends again on factors like
datanode hardware, block size, machine configuration etc. We should not increase the number of mappers beyond the certain limit because it will decrease the performance.
Since Hadoop works on huge volume of data and it is not workable to move such volume over the network. Hence it has come up with the most innovative principle of moving
algorithm to data rather than data to algorithm. This is called data locality.
Various phases of MapReduce job execution such as Input Files, InputFormat in Hadoop, InputSplits, RecordReader, Mapper, Combiner, Partitioner, Shuffling and Sorting, Reducer,
RecordWriter and OutputFormat.
Map Reduce Data flow
1. Input Files
The data for a MapReduce task is stored in input files, and input files typically lives in HDFS. The format of these files is arbitrary, while line-based log files and binary format
can also be used.
2. InputFormat
Now, InputFormat defines how these input files are split and read. It selects the files or other objects that are used for input. InputFormat creates InputSplit. Learn MapReduce
InputFormat in detail.
3. InputSplits
It is created by InputFormat, logically represent the data which will be processed by an individual Mapper (We will understand mapper below). One map task is created for each split; thus the number of map tasks will be equal to the number of InputSplits. The split is divided into records and each record will be processed by the mapper.
4. RecordReader
It communicates with the InputSplit in Hadoop MapReduce and converts the data into key-value pairs suitable for reading by the mapper. By default, it uses TextInputFormat for
converting data into a key-value pair. RecordReader communicates with the InputSplit until the file reading is not completed. It assigns byte offset (unique number) to each line
present in the file. Further, these key-value pairs are sent to the mapper for further processing.
5. Mapper
It processes each input record (from RecordReader) and generates new key-value pair, and this key-value pair generated by Mapper is completely different from the input pair. The output of Mapper is also known as intermediate output which is written to the local disk. The output of the Mapper is not stored on HDFS as this is temporary data and writing on HDFS will create unnecessary copies (also HDFS is a high latency system). Mappers output is passed to the combiner for further process
6. Combiner
The combiner is also known as ‘Mini-reducer’. Hadoop MapReduce Combiner performs local aggregation on the mappers’ output, which helps to minimize the data transfer between mapper and reducer (we will see reducer below). Once the combiner functionality is executed, the output is then passed to the partitioner for further work. Learn MapReduce Combiner in detail.
7. Partitioner
Hadoop MapReduce, Partitioner comes into the picture if we are working on more than one reducer (for one reducer partitioner is not used).
Partitioner takes the output from combiners and performs partitioning. Partitioning of output takes place on the basis of the key and then sorted. By hash function, key (or a subset of
the key) is used to derive the partition.
8. Shuffling and Sorting
Now, the output is Shuffled to the reduce node (which is a normal slave node but reduce phase will run here hence called as reducer node). The shuffling is the physical movement of the data which is done over the network. Once all the mappers are finished and their output is shuffled on the reducer nodes, then this intermediate output is merged and sorted,
which is then provided as input to reduce phase.
9. Reducer
It takes the set of intermediate key-value pairs produced by the mappers as the input and then runs a reducer function on each of them to generate the output. The output of the
reducer is the final output, which is stored in HDFS. Follow this link to learn about Reducer in detail.
10. RecordWriter
It writes these output key-value pair from the Reducer phase to the output files.
11. OutputFormat
The way these output key-value pairs are written in output files by RecordWriter is determined by the OutputFormat. OutputFormat instances provided by the Hadoop are used to write files in HDFS or on the local disk. Thus the final output of reducer is written on HDFS by OutputFormat instances.
Map Reduce Essentials
Comments
Post a Comment