Help with XML parsing in Python

Question

I have a XML file which contains 100s of documents inside . Each block looks like this:

<DOC>
<DOCNO> FR940104-2-00001 </DOCNO>
<PARENT> FR940104-2-00001 </PARENT>
<TEXT>

<!-- PJG FTAG 4703 -->

<!-- PJG STAG 4703 -->

<!-- PJG ITAG l=90 g=1 f=1 -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=90 g=1 f=4 -->
Federal Register
<!-- PJG /ITAG -->

<!-- PJG ITAG l=90 g=1 f=1 -->
 / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices
<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=01 g=1 f=1 -->
Vol. 59, No. 2
<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=02 g=1 f=1 -->
Tuesday, January 4, 1994
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<!-- PJG /STAG -->

<!-- PJG /FTAG -->
</TEXT>
</DOC>

I want load this XML doc into a dictionary Text. Key as DOCNO & Value as text inside tags. Also this text should not contain all the comments. Example Text['FR940104-2-00001'] must contain Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994. This is the code I wrote.

L = doc.getElementsByTagName("DOCNO")
for node2 in L:
    for node3 in node2.childNodes:
        if node3.nodeType == Node.TEXT_NODE:            
            docno.append(node3.data);
        #print node2.data
L = doc.getElementsByTagName("TEXT")
i = 0
for node2 in L:
    for node3 in node2.childNodes:
        if node3.nodeType == Node.TEXT_NODE:
            Text[docno[i]] = node3.data
    i = i+1

Surprisingly, with my code I'm getting Text['FR940104-2-00001'] as u'\n' How come?? How to get what I want

your question is not very clear

t00ny
– t00ny

2010-09-25 23:36:20 +00:00
Commented Sep 25, 2010 at 23:36 — t00ny
– t00ny, Commented Sep 25, 2010 at 23:36
@t00ny: improved my question.

pecker
– pecker

2010-09-25 23:42:05 +00:00
Commented Sep 25, 2010 at 23:42 — pecker
– pecker, Commented Sep 25, 2010 at 23:42

unutbu · Accepted Answer · 2010-09-26 00:20:12Z

4

You could avoid looping through the doc twice by using xml.sax.handler:

import xml.sax.handler
import collections


class DocBuilder(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.state=''
        self.docno=''
        self.text=collections.defaultdict(list)
    def startElement(self, name, attrs):
        self.state=name
    def endElement(self, name):
        if name==u'TEXT':
            self.docno=''
    def characters(self,content):        
        content=content.strip()
        if content:
            if self.state==u'DOCNO':
                self.docno+=content
            elif self.state==u'TEXT':
                if content:
                    self.text[self.docno].append(content)


with open('test.xml') as f:
    data=f.read()            
builder = DocBuilder()
xml.sax.parseString(data, builder)
for key,value in builder.text.iteritems():
    print('{k}: {v}'.format(k=key,v=' '.join(value)))
# FR940104-2-00001: Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994

answered Sep 26, 2010 at 0:20

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

shahjapan Over a year ago

can we use lxml for SAX parser, or how lxml.sax differs from xml.sax ?

unutbu Over a year ago

@Tumbleweed: Yes, one could use lxml.sax.saxify instead. The syntax is almost exactly the same as for xml.sax, though you'd have to change startElement to startElementNS since lxml.sax supports namespace-aware processing only. See codespeak.net/lxml/sax.html

unutbu Over a year ago

@Tumbleweed: Another option would be to use lxml.etree.iterparse or lxml.etree.XMLParser with a custom target. See Liza Daly's excellent article ibm.com/developerworks/xml/library/x-hiperfparse/#ibm-pcon for an example of how to do fast iterative parsing without building an entire parse tree in memory.

Robert Rossney · Accepted Answer · 2010-09-26 19:39:16Z

2

Similar to unutbu's answer, though I think simpler:

from lxml import etree
with open('test.xml') as f:
    doc=etree.parse(f)

result={}
for elm in doc.xpath("/DOC[DOCNO]"):
    key = elm.xpath("DOCNO")[0].text.strip()
    value = "".join(t.strip() for t in elm.xpath("TEXT/text()") if t.strip())
    result[key] = value

The XPath that finds the DOC element in this example needs to be changed to be appropriate for your real document - e.g. if there's a single top-level element that all the DOC elements are children of, you'd change it to /*/DOC. The predicate on that XPath skips any DOC element that doesn't have a DOCNO child, which would otherwise cause an exception when setting the key.

answered Sep 26, 2010 at 19:39

Robert Rossney

97.3k24 gold badges150 silver badges218 bronze badges

4 Comments

unutbu Over a year ago

Thanks for this. I think your version is not only simpler, it also (unlike my now deleted lxml-based answer) correctly handles adjacent DOCNO s with no TEXT in between.

snapshoe Over a year ago

+1 for lxml. Much better than python's xml support in the standard library.

Robert Rossney Over a year ago

@unutbu: it actually doesn't handle adjacent DOCNOs at all. It finds DOC elements that have at least one DOCNO child. For each, it looks in the first DOCNO element to find the key. If there are multiple DOCNOs, it ignores all but the first. Also, if there are multiple TEXT children, it concatenates their text nodes together.

unutbu Over a year ago

Suppose the xml had more than one pair of DOCNO and TEXT nodes. Do you see a way to modify your code to handle this case?

twasbrillig · Accepted Answer · 2014-11-14 05:57:33Z

1

Using lxml:

import lxml.etree as le
with open('test.xml') as f:
    doc=le.parse(f)

texts={}
for docno in doc.xpath('DOCNO'):
    docno_text=docno.text.strip()    
    text=' '.join([t.strip() 
          for t in  docno.xpath('following-sibling::TEXT[1]/text()')
          if t.strip()])
    texts[docno.text]=text

print(texts)
# {'FR940104-2-00001': 'Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994'}

This version is a tad simpler than my first lxml solution. It handles multiple instances of DOCNO, TEXT nodes. The DOCNO/TEXT nodes should alternate, but in any case, the DOCNO is associated with the closest TEXT node that follows it.

edited Nov 14, 2014 at 5:57

twasbrillig

19.2k9 gold badges47 silver badges71 bronze badges

answered Sep 26, 2010 at 2:07

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Comments

Ignacio Vazquez-Abrams · Accepted Answer · 2010-09-25 23:49:51Z

0

Your line

Text[docno[i]] = node3.data

replaces the value of the mapping instead of appending the new one. Your <TEXT> node has both text and comment children, interleaved with each other.

answered Sep 25, 2010 at 23:49

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Comments

claws · Accepted Answer · 2010-09-26 00:03:04Z

DOM parser strips out the comments automatically for you. Each line is a Node.

So, You need to use:

Text[docno[i]]+= node3.data but before that you need to have an empty dictionary with all the keys. So, you can add Text[node3.data] = ''; in your first block of code.

So, your code becomes:

L = doc.getElementsByTagName("DOCNO")
for node2 in L:
    for node3 in node2.childNodes:
        if node3.nodeType == Node.TEXT_NODE:            
            docno.append(node3.data);
            Text[node3.data] = '';
        #print node2.data

L = doc.getElementsByTagName("TEXT")
i = 0
for node2 in L:
    for node3 in node2.childNodes:
        if node3.nodeType == Node.TEXT_NODE:
            Text[docno[i]]+= node3.data
    i = i+1

Collectives™ on Stack Overflow

Help with XML parsing in Python

5 Answers 5

3 Comments

4 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related