5

So I need two files as an Input to my mapreduce program: City.dat and Country.dat

In my main method im parsing the command line arguments like this:

Path cityInputPath = new Path(args[0]);
Path countryInputPath = new Path(args[1]);
Path outputPath = new Path(args[2]);
MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);
MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);

If I'm now running my programm with the following command:

hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output

I get the following error:

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /user/cloudera/capital/input/Country.dat already exists

Why does it treat this as my output directory? I specified another directory as the output directory. Can somebody explain this?

1
  • 2
    Can you please change the question headline as it confuses anybody who is going through the question for the first time. Commented Apr 9, 2013 at 10:19

3 Answers 3

7

Based on the stacktrace, your output directory is not empty. So the simplest thing is actually to delete it before running the job:

bin/hadoop fs -rmr /user/cloudera/capital/output

Besides that, your arguments starting with the classname of your main class org.myorg.Capital. So that is the argument on the zero'th index. (Based on the stacktrace and the code you have provided).

Basically you need to shift all your indices one to the right:

Path cityInputPath = new Path(args[1]);
Path countryInputPath = new Path(args[2]);
Path outputPath = new Path(args[3]);
MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);
MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);

Don't forget to clear your output folder though!

Also a small tip for you, you can separate the files with comma "," so you can set them with a single call like this:

hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat,/user/cloudera/capital/input/Country.dat

And in your java code:

FileInputFormat.addInputPaths(job, args[1]);
Sign up to request clarification or add additional context in comments.

3 Comments

That is strange because I always started my programs with this command and it never treated org.myorg.Class as the zero'th argument. Shifting all my indices strangely leads to the same error. And also my output folder does not exist. The problem is that it thinks /user/cloudera/input/Country.dat is my output folder...That's why its not empty. The question is why does it think that this is my output folder.
If it leads to the exact same error, you are not running the code you have provided.
As far as I have worked with problems, @gaussd is right. org.myorg.Capital is not the 0th element in args. Its just saying that "Start with the class org.myorg.Capital in the capital.jar file"..
1

What is happening here is that the class name is deemed to be the first argument!

By default, the first non-option argument is the name of the class to be invoked. A fully-qualified class name should be used. If the -jar option is specified, the first non-option argument is the name of a JAR archive containing class and resource f iles for the application, with the startup class indicated by the Main-Class manifest header.

So What I would suggest that you add a Manifest files to your jar where in you specify the main class. Your MANIFEST.MF files may look like:

Manifest-Version: 1.0
Main-Class: org.myorg.Capital

And now your command would look like:

hadoop jar capital.jar /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output

You can certainly just change the index values being used in your code but that's not advisable solution.

Comments

-1

can you try this:

hadoop jar capital.jar /user/cloudera/capital/input /user/cloudera/capital/output

This should read all files in the single input directory.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.