I have a directory in HDFS that contains roughly 10,000 .xml files. I have a python script "processxml.py" that takes a file and does some processing on it. Is it possible to run the script on all of the files in the hdfs directory, or do I need to copy them to local first in order to do so?
For example, when I run the script on files in a local directory I have:
cd /path/to/files
for file in *.xml
do
python /path/processxml.py
$file > /path2/$file
done
So basically, how would I go about doing the same, but this time the files are in hdfs?