0

I am trying to process a file that has the following structure:

    <?xml version="1.0" encoding="ISO-8859-1" ?>
    <!DOCTYPE doc SYSTEM "djnml-1.0b.dtd">
    <doc msize="000007622" md5="235d6d9aa0071dd0bd711e812ff918fc" sysId="sbknwsarchp01" destination="AW" distId="    " transmission-date="                " >
    <djnml publisher="DJN" docdate="20160301" product="DN" seq="4" xml:lang="en-us" >
    <head>
    <copyright year="2016" holder="text" ></copyright>
    <docdata>
    <djn>
    <djn-newswires news-source="DJDN" origin="DJ" service-id="CO" >
    <djn-press-cutout/>
    <djn-urgency>0</djn-urgency>
    <djn-mdata brand="DJ" temp-perm="P" retention="N" hot="N" original-source="DJCS" accession-number="20160301000004" page-citation="" display-date="20160301T050006.315Z" >
    <djn-coding>
    <djn-government>
    <c>G/AGD</c>
    <c>G/USG</c>
    </djn-government>
    <djn-page>
    <c>70180</c>
    <c>83567</c>
    </djn-page>
    <djn-subject>
    <c>N/DJAG</c>
    <c>N/DJCS</c>
    </djn-subject>
    <djn-market>
    <c>M/MMR</c>
    </djn-market>
    <djn-product>
    <c>P/ACMD</c>
    <c>P/FNVW</c>
    </djn-product>
    <djn-geo>
    <c>R/NME</c>
    <c>R/TN</c>
    </djn-geo>
    </djn-coding>
    </djn-mdata>
    </djn-newswires>
    </djn>
    </docdata>
    </head>
    <body>
    <headline brand-display="DJ" >
    text</headline>
    <text>
    <pre>
    text
     </pre>
    <p>
      text</p>
    <p>
      text</p>
    </text>
    </body>
    </djnml>
    </doc>
<?xml version="1.0" encoding="iso-8859-1" ?>
<!DOCTYPE doc SYSTEM "djnml-1.0b.dtd">
<doc msize="000002698" md5="81b0dd0339b8c77febf46ebdaf8ef617" sysId="sbknwsarchp01" destination="AW" distId="    " transmission-date="                " >
<djnml publisher="DJN" docdate="20160301" product="DN" seq="70" xml:lang="en-us" >
<head>
<copyright year="2016" holder="text" ></copyright>
<docdata>
<djn>
<djn-newswires news-source="DJDN" origin="DJ" service-id="CO" >
<djn-press-cutout/>
<djn-urgency>0</djn-urgency>
<djn-mdata brand="DJ" temp-perm="P" retention="N" hot="N" original-source="FW" accession-number="20160301000070" page-citation="" display-date="20160301T052632.174Z" >
<djn-coding>
<djn-company>
<c>ANZ.AU</c>
<c>ANZ.NZ</c>
<c>ANZBY</c>
</djn-company>
<djn-isin>
<c>AU000000ANZ3</c>
<c>US0525283042</c>
</djn-isin>
<djn-industry>
<c>I/BAN</c>
<c>I/BKS</c>
</djn-industry>
<djn-page>
<c>22767</c>
<c>5014</c>
<c>55115</c>
</djn-page>
<djn-subject>
<c>N/AER</c>
<c>N/BKG</c>
</djn-subject>
<djn-market>
<c>M/FCL</c>
<c>M/NND</c>
</djn-market>
<djn-product>
<c>P/ABO</c>
<c>P/AEI</c>
</djn-product>
<djn-geo>
<c>R/ASA</c>
<c>R/FE</c>
</djn-geo>
</djn-coding>
</djn-mdata>
</djn-newswires>
</djn>
</docdata>
</head>
<body>
<headline brand-display="DJ" >
text</headline>
<text>
<pre>
 </pre>
<p>
     text </p>
<pre>

Editor JSM 

 </pre>
<p>
  text</p>
<p>
  text</p>
</text>
</body>
</djnml>
</doc>

I.e. the file contains many smaller "xml" parts.

I am trying the following code:

import xml.etree.ElementTree as ET
tree = ET.parse('test.nml')
root = tree.getroot()
print(root.iter('djn-subject'))
for element_1 in root.iter('djn-subject'):
    for element_2 in root.iter('c'):
        print(element_2.text)

which gives an error

  File "<string>", line unknown
ParseError: junk after document element: line 195, column 0

Any idea how I can get rid of this error? It seems my XML file has multiple roots, is there a way to wrap around everything around a root or another way to deal with this issue? Thank you.

5
  • Did you try breaking the file into pieces based on the xml start tags? Should be pretty easy to just read in the first set, then go back and get the rest. Commented Apr 18, 2017 at 14:11
  • See a previous answer of mine for a function that uses ElementTree to split multiple XML docs out of a single file. Commented Apr 18, 2017 at 14:21
  • 1
    It seems my XML file has multiple roots ...by W3C standards, this markup is not an XML file. By definition, XML is well-formed and hence conformant libraries like Python's etree should err out. Find the source of this markup be it a software, vendor, or programmer and fix the glitch before continuing in your development work. Commented Apr 18, 2017 at 17:36
  • Guys can you give me example of code in the form of a proper answer? I bought the data as they are This is the format that they are Commented Apr 18, 2017 at 18:16
  • @adrCoder: What about complaining at your supplier? To process a non-standard (since not well-formed) XML file is always ambitious, no matter, which tool you use. Commented Apr 19, 2017 at 10:22

1 Answer 1

0

XML etree expects only a single root node. If you have multiple roots, it's not going to parse it and you'll get an error like the one you see because it reads it as poorly formed xml. You'll need to edit your XML file so that all elements you are trying to retrieve are under one single root node, or you'll have to break each root node into multiple files and parse them individually (which is not the most efficient, but it depends on if your namespaces and xsd's are the same or different).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.