How to read files in HDFS directory using python

Question

I am trying to read files inside a directory in HDFS using Python. I used below code but i am getting error.

Code:

cat = Popen(["hadoop", "fs", "-cat", "/user/cloudera/CCMD"], stdout=PIPE)

Error:

cat: `/user/cloudera/CCMD': Is a directory
Traceback (most recent call last):
  File "hrkpat.py", line 6, in <module>
    tree = ET.parse(cat.stdout)
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 862, in parse
    tree.parse(source, parser)
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 587, in parse
    self._root = parser.close()
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 1254, in close
    self._parser.Parse("", 1) # end of data
xml.parsers.expat.ExpatError: no element found: line 1, column 0

Update:

I am having 10-15 xml files in my hdfs directory that i want to parse. I am able to parse the xml when only one xml is present in the directory but as soon as i am having multiple number of files i am not able to parse the xml. For this use case i want to write python code so that i can parse one file from my directory and once i parse it move to the next one.

Ronak Patel · Accepted Answer · 2017-02-27 12:15:02Z

2

you can use wildcard char * to read all files in dir:

hadoop fs -cat /user/cloudera/CCMD/*

Or just read xml files:

hadoop fs -cat /user/cloudera/CCMD/*.xml

answered Feb 27, 2017 at 12:15

Ronak Patel

3,8591 gold badge18 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

animal Over a year ago

It works if the directory is having one file, but when i am having multiple xml files in the directory it is throwing me error.

tree = ET.parse(cat.stdout)   File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 862, in parse xml.parsers.expat.ExpatError: junk after document element: line 12266, column 19 [cloudera@server ~]$ cat: Unable to write to output stream. cat: Unable to write to output stream.

Ronak Patel Over a year ago

reading all files from hdfs is fine using * but see this

animal Over a year ago

hadoop fs -cat /user/cloudera/CCMD/* this command will read all the files one by one or one at a time ?

Ronak Patel Over a year ago

output will be content of all files... so it reads all files in one time

Ronak Patel Over a year ago

you should use for...loop in python to read/process file(s) in that loop or MR with MultipleOutputFileForma as @franklinsijo suggested....

|

franklinsijo · Accepted Answer · 2017-02-27 11:50:31Z

1

Exception is cat: '/user/cloudera/CCMD': Is a directory

You are trying to perform a file operation over a directory. Pass the path of a file to the command.

Use this command in subprocess instead,

hadoop fs -cat /user/cloudera/CCMD/filename

edited Feb 27, 2017 at 11:50

answered Feb 27, 2017 at 11:45

franklinsijo

18.4k4 gold badges50 silver badges66 bronze badges

14 Comments

animal Over a year ago

file name can be different but the directory in which the file will be kept is same, that's why i want to give directory name instead of filename in the path.

animal Over a year ago

There will be multiple files in my directory that i want to read, not a single file.

franklinsijo Over a year ago

Then hadoop fs -cat is not your command. Either get a list of files in the hdfs directory and read them using the retrieved filenames. Buy why? Why are you trying to read files from hdfs? For next stage processing on the read data, if yes try writing MapReduce.

animal Over a year ago

I am having some 10-15 XML's in HDFS that i want to parse using python. For that i need to parse them sequentially or one by one as i cannot merge the xml.

franklinsijo Over a year ago

Why not download them to local with hadoop dfs -get and process them? If you want the output to be stored in hdfs then MapReduce is the solution.

|

Collectives™ on Stack Overflow

How to read files in HDFS directory using python

2 Answers 2

7 Comments

14 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

14 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related