2

I have a directory in HDFS that contains roughly 10,000 .xml files. I have a python script "processxml.py" that takes a file and does some processing on it. Is it possible to run the script on all of the files in the hdfs directory, or do I need to copy them to local first in order to do so?

For example, when I run the script on files in a local directory I have:

cd /path/to/files

for file in *.xml
do
python  /path/processxml.py 
$file > /path2/$file
done

So basically, how would I go about doing the same, but this time the files are in hdfs?

2
  • 1
    Can you modify the processxml.py file? You could use the python hdfs package: hdfscli.readthedocs.org/en/latest/… which allows you access to the files without needing to store them on your disk as an intermediate step, but unless you can modify your xml processor it probably wont help you. Commented Jan 28, 2016 at 20:35
  • Yes actually, I can modify the .py file, I will read the documentation..thanks @TomDalton Commented Jan 28, 2016 at 20:45

2 Answers 2

2

You basically have two options:

1) Use hadoop streaming connector to create a MapReduce job (here you will only need the map part). Use this command from the shell or inside a shell script:

hadoop jar <the location of the streamlib> \
        -D mapred.job.name=<name for the job> \
        -input /hdfs/input/dir \
        -output /hdfs/output/dir \
        -file your_script.py \
        -mapper python your_script.py \
        -numReduceTasks 0

2) Create a PIG script and ship your python code. Here is a basic example for the script:

input_data = LOAD '/hdfs/input/dir';
DEFINE mycommand `python your_script.py` ship('/path/to/your/script.py');
updated_data = STREAM input_data THROUGH mycommand PARALLEL 20;    
STORE updated_data INTO 'hdfs/output/dir';
Sign up to request clarification or add additional context in comments.

4 Comments

So it is not possible to just access the files in HDFS one at a time and run the py code on them? Something similar to here? @Javier
Also, are you missing semi-colons in the PIG example? @Javier
Yeah. Some semi-colon were missing. Corrected
Currently it just dumps all of the data together into the output dir. Is there a way to group the output data into its own .xml file for each input .xml file? @Javier
0

If you need to process data in your files or move/cp/rm/etc. them around the file-system then PySpark (Spark with Python interface) would be one of the best options (speed, memory).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.