6

In my input file when making the Jar for my MapReduce Job, I am using the Hadoop-local command. I wanted to know whether there was a way of, instead of specifically specifying the path for each file in my input folder to be used in the MapReduce job, whether I could just specify and pass all the files from my input folder. This is because the contents and number of files could change due to the nature of the MapReduce job I am trying to configure and as I do not know the specific amount of files, apart from just the contents of these files, is there a way to pass all files from the input folder into my MapReduce program and then iterate over each file to compute a certain function which would then send the results to the Reducer. I am only using one Map/Reduce program and I am coding in Java. I am able to use the hadoop-moonshot command, but I am working with hadoop-local at the moment.

Thanks.

4
  • If you specify an HDFS directory in your job instead of a file, then all the files should be read. Can you please edit your question to include the command you are running? Maybe some code in a minimal reproducible example, also? Commented May 14, 2016 at 17:46
  • Thanks @cricket_007 could you perhaps provide an example of a call to the HDFS directory instead of a single file please. Also, how would I have a separate output file for each input. I'm guessing it's by using the MultipleOutput class somehow, but I can't see how at the moment. Commented May 14, 2016 at 17:59
  • I cant remember how to output multiple files, but the mapreduce output itself must be to one directory. As for directory input, the wordcount example reads two files from one directory Commented May 14, 2016 at 18:04
  • @Shah.1 have you tried setting: FileInputFormat.setInputDirRecursive(mapReduceJob, true); to be able to read the files recursively? Commented May 17, 2016 at 12:23

1 Answer 1

1

You don't have to pass individual file as input for MapReduce Job.

FileInputFormat class already provides API to accept list of multiple files as Input to Map Reduce program.

public static void setInputPaths(Job job,
                 Path... inputPaths)
                          throws IOException

Add a Path to the list of inputs for the map-reduce job. Parameters:

conf - The configuration of the job

path - Path to be added to the list of inputs for the map-reduce job.

Example code from Apache tutorial

Job job = Job.getInstance(conf, "word count");
FileInputFormat.addInputPath(job, new Path(args[0]));

MultipleInputs provides below APIs.

public static void addInputPath(Job job,
                Path path,
                Class<? extends InputFormat> inputFormatClass,
                Class<? extends Mapper> mapperClass)

Add a Path with a custom InputFormat and Mapper to the list of inputs for the map-reduce job.

Related SE question:

Can hadoop take input from multiple directories and files

Refer to MultipleOutputs API regarding your second query on multiple output paths.

FileOutputFormat.setOutputPath(job, outDir);

// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
LongWritable.class, Text.class);

// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq",
SequenceFileOutputFormat.class,
LongWritable.class, Text.class);

Have a look at related SE questions regarding multiple output files.

Writing to multiple folders in hadoop?

hadoop method to send output to multiple directories

Sign up to request clarification or add additional context in comments.

6 Comments

That example code you pulled only uses one input path
Except in title of the question which says both Input/Output, OP is looking for multiple files as Input in the body of question. No mention of Output. Later I have added setOutputPath API.
The title and the question don't really match, though. All that was asked was reading a directory of files. Which, yes, this code can do. I was simply saying that you mention the multiple paths, but the example code doesn't use that method
Thanks @Ravindrababu but how I would use that to pass in multiple input files. And how would I specify multiple output files, each outputting to a different directory. Let's say a different text file outputting the results for each input.
Updated the answer
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.