1

I am trying to read files inside a directory in HDFS using Python. I used below code but i am getting error.

Code:

cat = Popen(["hadoop", "fs", "-cat", "/user/cloudera/CCMD"], stdout=PIPE)

Error:

cat: `/user/cloudera/CCMD': Is a directory
Traceback (most recent call last):
  File "hrkpat.py", line 6, in <module>
    tree = ET.parse(cat.stdout)
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 862, in parse
    tree.parse(source, parser)
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 587, in parse
    self._root = parser.close()
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 1254, in close
    self._parser.Parse("", 1) # end of data
xml.parsers.expat.ExpatError: no element found: line 1, column 0

Update:

I am having 10-15 xml files in my hdfs directory that i want to parse. I am able to parse the xml when only one xml is present in the directory but as soon as i am having multiple number of files i am not able to parse the xml. For this use case i want to write python code so that i can parse one file from my directory and once i parse it move to the next one.

2 Answers 2

2

you can use wildcard char * to read all files in dir:

hadoop fs -cat /user/cloudera/CCMD/*

Or just read xml files:

hadoop fs -cat /user/cloudera/CCMD/*.xml
Sign up to request clarification or add additional context in comments.

7 Comments

It works if the directory is having one file, but when i am having multiple xml files in the directory it is throwing me error. tree = ET.parse(cat.stdout) File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 862, in parse xml.parsers.expat.ExpatError: junk after document element: line 12266, column 19 [cloudera@server ~]$ cat: Unable to write to output stream. cat: Unable to write to output stream.
reading all files from hdfs is fine using * but see this
hadoop fs -cat /user/cloudera/CCMD/* this command will read all the files one by one or one at a time ?
output will be content of all files... so it reads all files in one time
you should use for...loop in python to read/process file(s) in that loop or MR with MultipleOutputFileForma as @franklinsijo suggested....
|
1

Exception is cat: '/user/cloudera/CCMD': Is a directory

You are trying to perform a file operation over a directory. Pass the path of a file to the command.

Use this command in subprocess instead,

hadoop fs -cat /user/cloudera/CCMD/filename

14 Comments

file name can be different but the directory in which the file will be kept is same, that's why i want to give directory name instead of filename in the path.
There will be multiple files in my directory that i want to read, not a single file.
Then hadoop fs -cat is not your command. Either get a list of files in the hdfs directory and read them using the retrieved filenames. Buy why? Why are you trying to read files from hdfs? For next stage processing on the read data, if yes try writing MapReduce.
I am having some 10-15 XML's in HDFS that i want to parse using python. For that i need to parse them sequentially or one by one as i cannot merge the xml.
Why not download them to local with hadoop dfs -get and process them? If you want the output to be stored in hdfs then MapReduce is the solution.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.