Sorting in hadoop framework

Question

I have tried implementing secondary sort. so i have a question related to that :

Sorting happens 3 times in Hadoop framework 

 1) Sorting in Buffer ( Sorting occur based on key of a map function)
 2) Sorting during merging of spill files of mapper output( ?????????????)
 3) Sorting at Reducer side when reducer gets map output from various mapper based on partition logic again merging happens .( Sorting occur based on Sort Comparator )

if my above understanding is correct, Then based on what logic sorting occurs during spill files merging on map output files ,it it based on keys that we use in map function or sort comparator on which reduce side sorting happen and why ?

In a nutshell, sorting the spills occurs, so that each mapper will output a single merged list of records (the buffer could be full, before a map task finishes). Map-side sorting takes place in order to "lighten" the sort-workload of the reducer. All of these sorting phases use the same sort comparator. — vefthym
– vefthym, Commented Sep 18, 2014 at 7:42

Srinivasarao Daruna · Accepted Answer · 2014-11-28 10:23:42Z

2

To answer precisely, in the buffer, the values are ordered based on the keys, where as at the reducer they will be compared using comparator.

This is how the sort at map end happens. Each map task has a circular memory buffer that it writes the output to. When the contents of the buffer reaches a certain threshold size ,a background thread will start to spill the contents to disk.

Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.

The final order at the reducer will be done by comparing each key to other one, which is nothing but a comparator.

To examine this, I have written a ReverseIntWritable, which will order in reverse to IntWritable and i have written the output same way from mapper and reducer.

If i have not used reducer, the input {(1, xyz), (2,ijk)} come out as {(1, xyz), (2,ijk)}. If i have used reducer, the output for the same input came out as {(2,ijk) , (1, xyz) }.

Hope this helps..

answered Nov 28, 2014 at 10:23

Srinivasarao Daruna

3,3748 gold badges31 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user3484461 Over a year ago

Thanks for your respose ,but my confusion was in second phase of sorting , i am preety much ok with first and third phase ,you also explained first and third , put some light on second phase ?

Srinivasarao Daruna Over a year ago

After map phase, the spills are merged into one sorted and paritioned file. The configuration property io.sort.factor controls the maximum number of spills to merge at once. partitioning indicates the to which partition it belongs to. You can search about partitioned file to just visualize it.

Srinivasarao Daruna Over a year ago

And these partitions in output file will be then made available to reducer over http. BTW, this is process at each mapper. This process happens at each and every mapper. last but not the least, when merging the spill files, the data will be sorted and partitioned by the key and if combiner function is there, combiner runs on the merged files.

Srinivasarao Daruna Over a year ago

Hi, Please let me know if you have any more questions. I thought you just want to know about the way of sorting as per your question. So gave the details at first. If you want me to put some more details about the above comments, i can do.

Collectives™ on Stack Overflow

Sorting in hadoop framework

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related