Python - parsing a file which contains multiple xml parts

Question

I am trying to process a file that has the following structure:

    <?xml version="1.0" encoding="ISO-8859-1" ?>
    <!DOCTYPE doc SYSTEM "djnml-1.0b.dtd">
    <doc msize="000007622" md5="235d6d9aa0071dd0bd711e812ff918fc" sysId="sbknwsarchp01" destination="AW" distId="    " transmission-date="                " >
    <djnml publisher="DJN" docdate="20160301" product="DN" seq="4" xml:lang="en-us" >
    <head>
    <copyright year="2016" holder="text" ></copyright>
    <docdata>
    <djn>
    <djn-newswires news-source="DJDN" origin="DJ" service-id="CO" >
    <djn-press-cutout/>
    <djn-urgency>0</djn-urgency>
    <djn-mdata brand="DJ" temp-perm="P" retention="N" hot="N" original-source="DJCS" accession-number="20160301000004" page-citation="" display-date="20160301T050006.315Z" >
    <djn-coding>
    <djn-government>
    <c>G/AGD</c>
    <c>G/USG</c>
    </djn-government>
    <djn-page>
    <c>70180</c>
    <c>83567</c>
    </djn-page>
    <djn-subject>
    <c>N/DJAG</c>
    <c>N/DJCS</c>
    </djn-subject>
    <djn-market>
    <c>M/MMR</c>
    </djn-market>
    <djn-product>
    <c>P/ACMD</c>
    <c>P/FNVW</c>
    </djn-product>
    <djn-geo>
    <c>R/NME</c>
    <c>R/TN</c>
    </djn-geo>
    </djn-coding>
    </djn-mdata>
    </djn-newswires>
    </djn>
    </docdata>
    </head>
    <body>
    <headline brand-display="DJ" >
    text</headline>
    <text>
    <pre>
    text
     </pre>
    <p>
      text</p>
    <p>
      text</p>
    </text>
    </body>
    </djnml>
    </doc>
<?xml version="1.0" encoding="iso-8859-1" ?>
<!DOCTYPE doc SYSTEM "djnml-1.0b.dtd">
<doc msize="000002698" md5="81b0dd0339b8c77febf46ebdaf8ef617" sysId="sbknwsarchp01" destination="AW" distId="    " transmission-date="                " >
<djnml publisher="DJN" docdate="20160301" product="DN" seq="70" xml:lang="en-us" >
<head>
<copyright year="2016" holder="text" ></copyright>
<docdata>
<djn>
<djn-newswires news-source="DJDN" origin="DJ" service-id="CO" >
<djn-press-cutout/>
<djn-urgency>0</djn-urgency>
<djn-mdata brand="DJ" temp-perm="P" retention="N" hot="N" original-source="FW" accession-number="20160301000070" page-citation="" display-date="20160301T052632.174Z" >
<djn-coding>
<djn-company>
<c>ANZ.AU</c>
<c>ANZ.NZ</c>
<c>ANZBY</c>
</djn-company>
<djn-isin>
<c>AU000000ANZ3</c>
<c>US0525283042</c>
</djn-isin>
<djn-industry>
<c>I/BAN</c>
<c>I/BKS</c>
</djn-industry>
<djn-page>
<c>22767</c>
<c>5014</c>
<c>55115</c>
</djn-page>
<djn-subject>
<c>N/AER</c>
<c>N/BKG</c>
</djn-subject>
<djn-market>
<c>M/FCL</c>
<c>M/NND</c>
</djn-market>
<djn-product>
<c>P/ABO</c>
<c>P/AEI</c>
</djn-product>
<djn-geo>
<c>R/ASA</c>
<c>R/FE</c>
</djn-geo>
</djn-coding>
</djn-mdata>
</djn-newswires>
</djn>
</docdata>
</head>
<body>
<headline brand-display="DJ" >
text</headline>
<text>
<pre>
 </pre>
<p>
     text </p>
<pre>

Editor JSM 

 </pre>
<p>
  text</p>
<p>
  text</p>
</text>
</body>
</djnml>
</doc>

I.e. the file contains many smaller "xml" parts.

I am trying the following code:

import xml.etree.ElementTree as ET
tree = ET.parse('test.nml')
root = tree.getroot()
print(root.iter('djn-subject'))
for element_1 in root.iter('djn-subject'):
    for element_2 in root.iter('c'):
        print(element_2.text)

which gives an error

  File "<string>", line unknown
ParseError: junk after document element: line 195, column 0

Any idea how I can get rid of this error? It seems my XML file has multiple roots, is there a way to wrap around everything around a root or another way to deal with this issue? Thank you.

Did you try breaking the file into pieces based on the xml start tags? Should be pretty easy to just read in the first set, then go back and get the rest. — Chris
– Chris, Commented Apr 18, 2017 at 14:11
See a previous answer of mine for a function that uses ElementTree to split multiple XML docs out of a single file. — Robᵩ
– Robᵩ, Commented Apr 18, 2017 at 14:21
It seems my XML file has multiple roots ...by W3C standards, this markup is not an XML file. By definition, XML is well-formed and hence conformant libraries like Python's etree should err out. Find the source of this markup be it a software, vendor, or programmer and fix the glitch before continuing in your development work. — Parfait
– Parfait, Commented Apr 18, 2017 at 17:36
Guys can you give me example of code in the form of a proper answer? I bought the data as they are This is the format that they are — adrCoder
– adrCoder, Commented Apr 18, 2017 at 18:16
@adrCoder: What about complaining at your supplier? To process a non-standard (since not well-formed) XML file is always ambitious, no matter, which tool you use. — guidot
– guidot, Commented Apr 19, 2017 at 10:22

Kevin Pasquarella · Accepted Answer · 2017-04-18 14:15:56Z

0

XML etree expects only a single root node. If you have multiple roots, it's not going to parse it and you'll get an error like the one you see because it reads it as poorly formed xml. You'll need to edit your XML file so that all elements you are trying to retrieve are under one single root node, or you'll have to break each root node into multiple files and parse them individually (which is not the most efficient, but it depends on if your namespaces and xsd's are the same or different).

answered Apr 18, 2017 at 14:15

Kevin Pasquarella

893 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python - parsing a file which contains multiple xml parts

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related