11

Trying to parse the following Python file using the lxml.etree.iterparse function.

"sampleoutput.xml"

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
</item>

I tried the code from Parsing Large XML file with Python lxml and Iterparse

before the etree.iterparse(MYFILE) call I did MYFILE = open("/Users/eric/Desktop/wikipedia_map/sampleoutput.xml","r")

But it turns up the following error

Traceback (most recent call last):
  File "/Users/eric/Documents/Programming/Eclipse_Workspace/wikipedia_mapper/testscraper.py", line 6, in <module>
    for event, elem in context :
  File "iterparse.pxi", line 491, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:98565)
  File "iterparse.pxi", line 543, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:99086)
  File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712)
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 5, column 1

any ideas? thank you!

5
  • 1
    Could it be that your XML file is malformed? It contains no <?xml tag or a root element. Commented Jul 9, 2012 at 4:33
  • 1
    I don't know lxml, but your example isn't valid XML. An XML document has to have a single root element. Yours doesn't. Commented Jul 9, 2012 at 4:35
  • 2
    you need a root element, not only child nodes. Commented Jul 9, 2012 at 5:39
  • Is there a way to ignore not having a root element? Commented Jan 2, 2015 at 7:28
  • Yes, I edited the accepted answer with some solutions. There are two: 1. pretend it's html (using the html flag in the parser), 2. wrap the file object with something that adds root elements. Commented Aug 31, 2015 at 9:47

2 Answers 2

13

The problem is that XML isn't well-formed if it doesn't have exactly one top-level tag. You can fix your sample by wrapping the entire document in <items></items> tags. You also need the <desc/> tags to match the query that you're using (description).

The following document produces correct results with your existing code:

<items>
  <item>
    <title>Item 1</title>
    <description>Description 1</description>
  </item>
  <item>
    <title>Item 2</title>
    <description>Description 2</description>
  </item>
</items>
Sign up to request clarification or add additional context in comments.

1 Comment

what if the file is so large and i don't want to load it in memory so i am parsing it using iterparse ?
5

As far as I know, xml.etree.ElementTree usually expects the XML file to contain one "root" element, i.e. one XML tag that encloses the complete document structure. From the error message you posted I would assume that this is the problem here as well:

´Line 5´ refers to the second <item> tag, so I guess Python complains that there is more data following after the assumed root element (i.e. the first <item> tag) was closed.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.