Parse xml from file using etree works when reading string, but not a file

Question

I am a relative newby to Python and SO. I have an xml file from which I need to extract information. I've been struggling with this for several days, but I think I finally found something that will extract the information properly. Now I'm having troubles getting the right output. Here is my code:

from xml import etree
node = etree.fromstring('<dataObject><identifier>5e1882d882ec530069d6d29e28944396</identifier><description>This is a paragraph about a shark.</description></dataObject>')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description

The result that I get is "5e1882d882ec530069d6d29e28944396 This is a paragraph about a shark.", which is what I want.

However, what I really need is to be able to read from a file instead of a string. So I try this code:

from xml import etree
node = etree.parse('test3.xml')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description

Now my result is "None None". I have a feeling I'm either not getting the file in right or something is wrong with the output. Here is the contents of test3.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response xmlns="http://www.eol.org/transfer/content/0.3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dwc="http://rs.tdwg.org/dwc/dwcore/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dwct="http://rs.tdwg.org/dwc/terms/" xsi:schemaLocation="http://www.eol.org/transfer/content/0.3 http://services.eol.org/schema/content_0_3.xsd">
  <identifier>5e1882d822ec530069d6d29e28944369</identifier>
  <description>This is a paragraph about a shark.</description>

Martijn Pieters · Accepted Answer · 2013-03-12 16:00:24Z

Your XML file uses a default namespace. You need to qualify your searches with the correct namespace:

identifier = node.findtext('{http://www.eol.org/transfer/content/0.3}identifier')

for ElementTree to match the correct elements.

You could also give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:

namespaces = {'eol': 'http://www.eol.org/transfer/content/0.3'} # add more as needed

root.findall('eol:identifier', namespaces=namespaces)

Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the eol: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.eol.org/transfer/content/0.3}identifier instead.

If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.

amadain · Accepted Answer · 2013-03-12 15:21:12Z

0

Have you thought of trying beautifulsoup to parse your xml with python:

http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing%20XML

There is some good documentation and a healthy online group so support is quite good

A

answered Mar 12, 2013 at 15:21

amadain

2,8665 gold badges38 silver badges60 bronze badges

1 Comment

user2161557 Over a year ago

I have actually not thought of that. I will try it.

Collectives™ on Stack Overflow

Parse xml from file using etree works when reading string, but not a file

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related