1

Attempting to add a root tag to the beginning and end of a 2mil line XML file so the file can be properly processed with my Python code.

I tried using this code from a previous post, but I am getting an error "XMLSyntaxError: Extra content at the end of the document, line __, column 1"

How do I solve this? Or is there a better way to add a root tag to the beginning and end of my large XML doc?

import lxml.etree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
newroot = ET.Element("root")
newroot.insert(0, root)
print(ET.tostring(newroot, pretty_print=True))

My test XML

<pub>
    <ID>75</ID>
    <title>Use of Lexicon Density in Evaluating Word Recognizers</title>
    <year>2000</year>
    <booktitle>Multiple Classifier Systems</booktitle>
    <pages>310-319</pages>
    <authors>
        <author>Petr Slav&iacute;k</author>
        <author>Venu Govindaraju</author>
    </authors>
</pub>
<pub>
    <ID>120</ID>
    <title>Virtual endoscopy with force feedback - a new system for neurosurgical training</title>
    <year>2003</year>
    <booktitle>CARS</booktitle>
    <pages>782-787</pages>
    <authors>
        <author>Christos Trantakis</author>
        <author>Friedrich Bootz</author>
        <author>Gero Strau&szlig;</author>
        <author>Edgar Nowatius</author>
        <author>Dirk Lindner</author>
        <author>H&uuml;seyin Kem&acirc;l &Ccedil;akmak</author>
        <author>Heiko Maa&szlig;</author>
        <author>Uwe G. K&uuml;hnapfel</author>
        <author>J&uuml;rgen Meixensberger</author>
    </authors>
</pub>
3
  • 1
    Your test.xml document has no root element so it is not really XML and cannot be parsed as such. Commented Apr 24, 2017 at 21:30
  • @mzjn you missed the point, I was attempting to add the root tag so it could be read as XML. Commented Apr 25, 2017 at 3:03
  • Well, my point was that you tried to parse test.xml as XML before adding the root element. That's why you got the error. Commented Apr 25, 2017 at 6:45

1 Answer 1

1

I suspect that that gambit works because there is only one A element at the highest level. Fortunately, even with two million lines it's easy to add the lines you need.

In doing this I noticed that the lxml parser seems unable to process the accented characters. I have there added code to anglicise them.

import re

def anglicise(matchobj): return matchobj.group(0)[1]

outputFilename = 'result.xml'

with open('test.xml') as inXML, open(outputFilename, 'w') as outXML:
    outXML.write('<root>\n')
    for line in inXML.readlines():
        outXML.write(re.sub('&[a-zA-Z]+;',anglicise,line))
    outXML.write('</root>\n')

from lxml import etree

tree = etree.parse(outputFilename)
years = tree.xpath('.//year')
print (years[0].text)

Edit: Replace anglicise to this version to avoid replacing &amp;.

def anglicise(matchobj): 
    if matchobj.group(0) == '&amp;':
        return matchobj.group(0)
    else:
        return matchobj.group(0)[1]
Sign up to request clarification or add additional context in comments.

3 Comments

Awesome! I am getting a file output with everything, however, I have some entries in the bigger XML file with an &amp; for '&' , and the code is converting them, but to an 'a' character. I don't quite understand the outXML.write(re.sub('&[a-z]+;',anglicise,line)) part of your code, how do I adjust that to process the &?
Sorry one last thing, my code is now getting stuck up on the second, but not the first author with the &szlig; character in their name. I attached the xml node in my (edited) question. Seems odd to me that it wouldn't stop at the first, but the second throw an error
I've changed the regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.