parsing large xml file with Python - etree.parse error

Question

Trying to parse the following Python file using the lxml.etree.iterparse function.

"sampleoutput.xml"

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
</item>

I tried the code from Parsing Large XML file with Python lxml and Iterparse

before the etree.iterparse(MYFILE) call I did MYFILE = open("/Users/eric/Desktop/wikipedia_map/sampleoutput.xml","r")

But it turns up the following error

Traceback (most recent call last):
  File "/Users/eric/Documents/Programming/Eclipse_Workspace/wikipedia_mapper/testscraper.py", line 6, in <module>
    for event, elem in context :
  File "iterparse.pxi", line 491, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:98565)
  File "iterparse.pxi", line 543, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:99086)
  File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712)
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 5, column 1

any ideas? thank you!

Could it be that your XML file is malformed? It contains no <?xml tag or a root element. — C0deH4cker
– C0deH4cker, Commented Jul 9, 2012 at 4:33
I don't know lxml, but your example isn't valid XML. An XML document has to have a single root element. Yours doesn't. — Peter Graham
– Peter Graham, Commented Jul 9, 2012 at 4:35
Yes, I edited the accepted answer with some solutions. There are two: 1. pretend it's html (using the html flag in the parser), 2. wrap the file object with something that adds root elements. — Emiel
– Emiel, Commented Aug 31, 2015 at 9:47

sblom · Accepted Answer · 2012-07-09 05:01:29Z

13

The problem is that XML isn't well-formed if it doesn't have exactly one top-level tag. You can fix your sample by wrapping the entire document in <items></items> tags. You also need the <desc/> tags to match the query that you're using (description).

The following document produces correct results with your existing code:

<items>
  <item>
    <title>Item 1</title>
    <description>Description 1</description>
  </item>
  <item>
    <title>Item 2</title>
    <description>Description 2</description>
  </item>
</items>

answered Jul 9, 2012 at 5:01

sblom

27.5k4 gold badges74 silver badges96 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Hady Elsahar Over a year ago

what if the file is so large and i don't want to load it in memory so i am parsing it using iterparse ?

Michael Schlottke-Lakemper · Accepted Answer · 2012-07-09 04:39:49Z

5

As far as I know, xml.etree.ElementTree usually expects the XML file to contain one "root" element, i.e. one XML tag that encloses the complete document structure. From the error message you posted I would assume that this is the problem here as well:

´Line 5´ refers to the second <item> tag, so I guess Python complains that there is more data following after the assumed root element (i.e. the first <item> tag) was closed.

answered Jul 9, 2012 at 4:39

Michael Schlottke-Lakemper

9,4897 gold badges39 silver badges64 bronze badges

Collectives™ on Stack Overflow

parsing large xml file with Python - etree.parse error

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related