4

I have tried implementing secondary sort. so i have a question related to that :

Sorting happens 3 times in Hadoop framework 

 1) Sorting in Buffer ( Sorting occur based on key of a map function)
 2) Sorting during merging of spill files of mapper output( ?????????????)
 3) Sorting at Reducer side when reducer gets map output from various mapper based on partition logic again merging happens .( Sorting occur based on Sort Comparator )

if my above understanding is correct, Then based on what logic sorting occurs during spill files merging on map output files ,it it based on keys that we use in map function or sort comparator on which reduce side sorting happen and why ?

2
  • In a nutshell, sorting the spills occurs, so that each mapper will output a single merged list of records (the buffer could be full, before a map task finishes). Map-side sorting takes place in order to "lighten" the sort-workload of the reducer. All of these sorting phases use the same sort comparator. Commented Sep 18, 2014 at 7:42
  • key based sorting would be used only first time in buffer ? Commented Sep 18, 2014 at 11:58

1 Answer 1

2

To answer precisely, in the buffer, the values are ordered based on the keys, where as at the reducer they will be compared using comparator.

This is how the sort at map end happens. Each map task has a circular memory buffer that it writes the output to. When the contents of the buffer reaches a certain threshold size ,a background thread will start to spill the contents to disk.

Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.

The final order at the reducer will be done by comparing each key to other one, which is nothing but a comparator.

To examine this, I have written a ReverseIntWritable, which will order in reverse to IntWritable and i have written the output same way from mapper and reducer.

If i have not used reducer, the input {(1, xyz), (2,ijk)} come out as {(1, xyz), (2,ijk)}. If i have used reducer, the output for the same input came out as {(2,ijk) , (1, xyz) }.

Hope this helps..

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your respose ,but my confusion was in second phase of sorting , i am preety much ok with first and third phase ,you also explained first and third , put some light on second phase ?
After map phase, the spills are merged into one sorted and paritioned file. The configuration property io.sort.factor controls the maximum number of spills to merge at once. partitioning indicates the to which partition it belongs to. You can search about partitioned file to just visualize it.
And these partitions in output file will be then made available to reducer over http. BTW, this is process at each mapper. This process happens at each and every mapper. last but not the least, when merging the spill files, the data will be sorted and partitioned by the key and if combiner function is there, combiner runs on the merged files.
Hi, Please let me know if you have any more questions. I thought you just want to know about the way of sorting as per your question. So gave the details at first. If you want me to put some more details about the above comments, i can do.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.