0

Trying to read bulk data from US Patent and Trade Office. Have tried several xml files from here, I get the same results:

import xml.etree.ElementTree as ET
import re
file = 'ipgb20210105.xml'
tree = ET.parse(file)

yields: "ParseError: junk after document element: line 862, column 0"

Have tried recommendation to wrap with fake root node, but this doesn't work either:

with open(file) as f:
    xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")

yields: "ParseError: not well-formed (invalid token): line 2, column 2"

Any help much appreciated!

2
  • ipgb20210105.xml is not one big well-formed XML document. It consists of thousands of small XML documents (each with its own XML declaration) squashed together. Commented Apr 15, 2021 at 17:36
  • Try Python 3: Split concatenated XML files. Commented Apr 16, 2021 at 8:09

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.