5

I'm new to Hadoop and I'm trying to figure out how it works. As for an exercise I should implement something similar to the WordCount-Example. The task is to read in several files, do the WordCount and write an output file for each input file. Hadoop uses a combiner and shuffles the output of the map-part as an input for the reducer, then writes one output file (I guess for each instance that is running). I was wondering if it is possible to write one output file for each input file (so keep the words of inputfile1 and write result to outputfile1 and so on). Is it possible to overwrite the Combiner-Class or is there another solution for this (I'm not sure if this should even be solved in a Hadoop-Task but this is the exercise).

Thanks...

2 Answers 2

1

map.input.file environment parameter has the file name which the mapper is processing. Get this value in the mapper and use this as the output key for the mapper and then all the k/v from a single file to go to one reducer.

The code in the mapper. BTW, I am using the old MR API

@Override
public void configure(JobConf conf) {
    this.conf = conf;
}

@Override.
public void map(................) throws IOException {

        String filename = conf.get("map.input.file");
        output.collect(new Text(filename), value);
}

And use MultipleOutputFormat, this allows to write multiple output files for the job. The file names can be derived from the output keys and values.

Sign up to request clarification or add additional context in comments.

1 Comment

thanks, I think that is the best idea. Maybe I will use the old API because it seems easier to handle, but first I'll have a look at 0.20
0

Hadoop 'chunks' data into blocks of a configured size. Default is 64MB blocks. You may see where this causes issues for your approach; Each mapper may get only a piece of a file. If the file is less than 64MB (or whatever value is configured), then each mapper will get only 1 file.

I've had a very similar constraint; I needed a set of files (output from previous reducer in chain) to be entirely processed by a single mapper. I use the <64MB fact in my solution The main thrust of my solution is that I set it up to provide the mapper with the file name it needed to process, and internal to the mapper had it load/read the file. This allows a single mapper to process an entire file - It's not distributed processing of the file, but with the constraint of "I don't want individual files distributed" - it works. :)

I had the process that launched my MR write out the file names of the files to process into individual files. Where those files were written was the input directory. As each file is <64MB, then a single mapper will be generated for each file. The map process will be called exactly once (as there is only 1 entry in the file).
I then take the value passed to the mapper and can open the file and do whatever mapping I need to do. Since hadoop tries to be smart about how it does Map/Reduce processes, it may be required to specify the number of reducers to use so that each mapper goes to a single reducer. This can be set via the mapred.reduce.tasks configuration. I do this via job.setNumReduceTasks("mapred.reduce.tasks",[NUMBER OF FILES HERE]);

My process had some additional requirements/constraints that may have made this specific solution appealing; but for an example of a 1:in to 1:out; I've done it, and the basics are laid out above.

HTH

3 Comments

Thank you for the input. With setNumReduceTasks I get as many output files as I need. However the input for the reducers still gets mixed/shuffled. I did check the output of my mapper and it seems like one mapper is processing two files (but this shouldn't be the problem). But also the results of the mappers that only process one file get mixed with the results of the other mappers. Can I prevent Hadoop from doing this (shuffle/combine? Maybe set the combinerclass?) Did you just get all filenames and pass them to the mapper? Or am I missing something? Maybe another conf-value to be set?
To force a specific reducer, have each mapper use a specific key when writing the output. The same keys will go to the same mapper. You could pass a different value in the conf for each job then use that value as the key. That would result in the output for each mapper going to a single reducer (in my experience).
Passing file names to a mapper so that a file is processed by a single mapper is not the efficient approach. There is no data localization and there will be data more shuffling. One way to solve is this to bundle the dependent files into 1 (gz, tar) and return false from FileInputFormat#isSplitable method.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.