Hadoop Mapreduce multiple Input files

Question

So I need two files as an Input to my mapreduce program: City.dat and Country.dat

In my main method im parsing the command line arguments like this:

Path cityInputPath = new Path(args[0]);
Path countryInputPath = new Path(args[1]);
Path outputPath = new Path(args[2]);
MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);
MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);

If I'm now running my programm with the following command:

hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output

I get the following error:

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /user/cloudera/capital/input/Country.dat already exists

Why does it treat this as my output directory? I specified another directory as the output directory. Can somebody explain this?

Can you please change the question headline as it confuses anybody who is going through the question for the first time. — Abhishek Jain
– Abhishek Jain, Commented Apr 9, 2013 at 10:19

Thomas Jungblut · Accepted Answer · 2012-11-05 18:43:53Z

7

Based on the stacktrace, your output directory is not empty. So the simplest thing is actually to delete it before running the job:

bin/hadoop fs -rmr /user/cloudera/capital/output

Besides that, your arguments starting with the classname of your main class org.myorg.Capital. So that is the argument on the zero'th index. (Based on the stacktrace and the code you have provided).

Basically you need to shift all your indices one to the right:

Path cityInputPath = new Path(args[1]);
Path countryInputPath = new Path(args[2]);
Path outputPath = new Path(args[3]);
MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);
MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);

Don't forget to clear your output folder though!

Also a small tip for you, you can separate the files with comma "," so you can set them with a single call like this:

hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat,/user/cloudera/capital/input/Country.dat

And in your java code:

FileInputFormat.addInputPaths(job, args[1]);

edited Nov 5, 2012 at 18:43

answered Nov 5, 2012 at 18:15

Thomas Jungblut

21k6 gold badges71 silver badges92 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

gaussd Over a year ago

That is strange because I always started my programs with this command and it never treated org.myorg.Class as the zero'th argument. Shifting all my indices strangely leads to the same error. And also my output folder does not exist. The problem is that it thinks /user/cloudera/input/Country.dat is my output folder...That's why its not empty. The question is why does it think that this is my output folder.

Thomas Jungblut Over a year ago

If it leads to the exact same error, you are not running the code you have provided.

pk10 Over a year ago

As far as I have worked with problems, @gaussd is right. org.myorg.Capital is not the 0th element in args. Its just saying that "Start with the class org.myorg.Capital in the capital.jar file"..

Amar · Accepted Answer · 2012-11-05 22:20:06Z

What is happening here is that the class name is deemed to be the first argument!

By default, the first non-option argument is the name of the class to be invoked. A fully-qualified class name should be used. If the -jar option is specified, the first non-option argument is the name of a JAR archive containing class and resource f iles for the application, with the startup class indicated by the Main-Class manifest header.

So What I would suggest that you add a Manifest files to your jar where in you specify the main class. Your MANIFEST.MF files may look like:

Manifest-Version: 1.0
Main-Class: org.myorg.Capital

And now your command would look like:

hadoop jar capital.jar /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output

You can certainly just change the index values being used in your code but that's not advisable solution.

Kishore Paila · Accepted Answer · 2013-11-14 05:40:47Z

-1

can you try this:

hadoop jar capital.jar /user/cloudera/capital/input /user/cloudera/capital/output

This should read all files in the single input directory.

answered Nov 14, 2013 at 5:40

Kishore Paila

471 silver badge8 bronze badges

Collectives™ on Stack Overflow

Hadoop Mapreduce multiple Input files

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related