Hadoop MapReduce - one output file for each input

Question

I'm new to Hadoop and I'm trying to figure out how it works. As for an exercise I should implement something similar to the WordCount-Example. The task is to read in several files, do the WordCount and write an output file for each input file. Hadoop uses a combiner and shuffles the output of the map-part as an input for the reducer, then writes one output file (I guess for each instance that is running). I was wondering if it is possible to write one output file for each input file (so keep the words of inputfile1 and write result to outputfile1 and so on). Is it possible to overwrite the Combiner-Class or is there another solution for this (I'm not sure if this should even be solved in a Hadoop-Task but this is the exercise).

Thanks...

Praveen Sripati · Accepted Answer · 2012-01-17 02:22:19Z

1

map.input.file environment parameter has the file name which the mapper is processing. Get this value in the mapper and use this as the output key for the mapper and then all the k/v from a single file to go to one reducer.

The code in the mapper. BTW, I am using the old MR API

@Override
public void configure(JobConf conf) {
    this.conf = conf;
}

@Override.
public void map(................) throws IOException {

        String filename = conf.get("map.input.file");
        output.collect(new Text(filename), value);
}

And use MultipleOutputFormat, this allows to write multiple output files for the job. The file names can be derived from the output keys and values.

answered Jan 17, 2012 at 2:22

Praveen Sripati

33.7k18 gold badges85 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

spooky Over a year ago

thanks, I think that is the best idea. Maybe I will use the old API because it seems easier to handle, but first I'll have a look at 0.20

QuinnG · Accepted Answer · 2012-01-16 22:02:21Z

0

Hadoop 'chunks' data into blocks of a configured size. Default is 64MB blocks. You may see where this causes issues for your approach; Each mapper may get only a piece of a file. If the file is less than 64MB (or whatever value is configured), then each mapper will get only 1 file.

I've had a very similar constraint; I needed a set of files (output from previous reducer in chain) to be entirely processed by a single mapper. I use the <64MB fact in my solution The main thrust of my solution is that I set it up to provide the mapper with the file name it needed to process, and internal to the mapper had it load/read the file. This allows a single mapper to process an entire file - It's not distributed processing of the file, but with the constraint of "I don't want individual files distributed" - it works. :)

I had the process that launched my MR write out the file names of the files to process into individual files. Where those files were written was the input directory. As each file is <64MB, then a single mapper will be generated for each file. The map process will be called exactly once (as there is only 1 entry in the file).
I then take the value passed to the mapper and can open the file and do whatever mapping I need to do. Since hadoop tries to be smart about how it does Map/Reduce processes, it may be required to specify the number of reducers to use so that each mapper goes to a single reducer. This can be set via the mapred.reduce.tasks configuration. I do this via job.setNumReduceTasks("mapred.reduce.tasks",[NUMBER OF FILES HERE]);

My process had some additional requirements/constraints that may have made this specific solution appealing; but for an example of a 1:in to 1:out; I've done it, and the basics are laid out above.

HTH

answered Jan 16, 2012 at 22:02

QuinnG

6,3952 gold badges42 silver badges48 bronze badges

3 Comments

spooky Over a year ago

Thank you for the input. With setNumReduceTasks I get as many output files as I need. However the input for the reducers still gets mixed/shuffled. I did check the output of my mapper and it seems like one mapper is processing two files (but this shouldn't be the problem). But also the results of the mappers that only process one file get mixed with the results of the other mappers. Can I prevent Hadoop from doing this (shuffle/combine? Maybe set the combinerclass?) Did you just get all filenames and pass them to the mapper? Or am I missing something? Maybe another conf-value to be set?

QuinnG Over a year ago

To force a specific reducer, have each mapper use a specific key when writing the output. The same keys will go to the same mapper. You could pass a different value in the conf for each job then use that value as the key. That would result in the output for each mapper going to a single reducer (in my experience).

Praveen Sripati Over a year ago

Passing file names to a mapper so that a file is processed by a single mapper is not the efficient approach. There is no data localization and there will be data more shuffling. One way to solve is this to bundle the dependent files into 1 (gz, tar) and return false from FileInputFormat#isSplitable method.

Collectives™ on Stack Overflow

Hadoop MapReduce - one output file for each input

2 Answers 2

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related