Processing multiple files in HDFS via Python

Question

I have a directory in HDFS that contains roughly 10,000 .xml files. I have a python script "processxml.py" that takes a file and does some processing on it. Is it possible to run the script on all of the files in the hdfs directory, or do I need to copy them to local first in order to do so?

For example, when I run the script on files in a local directory I have:

cd /path/to/files

for file in *.xml
do
python  /path/processxml.py 
$file > /path2/$file
done

So basically, how would I go about doing the same, but this time the files are in hdfs?

Can you modify the processxml.py file? You could use the python hdfs package: hdfscli.readthedocs.org/en/latest/… which allows you access to the files without needing to store them on your disk as an intermediate step, but unless you can modify your xml processor it probably wont help you. — Tom Dalton
– Tom Dalton, Commented Jan 28, 2016 at 20:35
Yes actually, I can modify the .py file, I will read the documentation..thanks @TomDalton — Danzo
– Danzo, Commented Jan 28, 2016 at 20:45

Javier · Accepted Answer · 2016-01-31 22:36:00Z

2

You basically have two options:

1) Use hadoop streaming connector to create a MapReduce job (here you will only need the map part). Use this command from the shell or inside a shell script:

hadoop jar <the location of the streamlib> \
        -D mapred.job.name=<name for the job> \
        -input /hdfs/input/dir \
        -output /hdfs/output/dir \
        -file your_script.py \
        -mapper python your_script.py \
        -numReduceTasks 0

2) Create a PIG script and ship your python code. Here is a basic example for the script:

input_data = LOAD '/hdfs/input/dir';
DEFINE mycommand `python your_script.py` ship('/path/to/your/script.py');
updated_data = STREAM input_data THROUGH mycommand PARALLEL 20;    
STORE updated_data INTO 'hdfs/output/dir';

edited Jan 31, 2016 at 22:36

answered Jan 28, 2016 at 21:19

Javier

2,8684 gold badges22 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Danzo Over a year ago

So it is not possible to just access the files in HDFS one at a time and run the py code on them? Something similar to here? @Javier

Danzo Over a year ago

Also, are you missing semi-colons in the PIG example? @Javier

Javier Over a year ago

Yeah. Some semi-colon were missing. Corrected

Danzo Over a year ago

Currently it just dumps all of the data together into the output dir. Is there a way to group the output data into its own .xml file for each input .xml file? @Javier

Danylo Zherebetskyy · Accepted Answer · 2017-09-18 23:58:19Z

0

If you need to process data in your files or move/cp/rm/etc. them around the file-system then PySpark (Spark with Python interface) would be one of the best options (speed, memory).

answered Sep 18, 2017 at 23:58

Danylo Zherebetskyy

1,52714 silver badges9 bronze badges

Collectives™ on Stack Overflow

Processing multiple files in HDFS via Python

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related