1

I have a XML file which contains 100s of documents inside . Each block looks like this:

<DOC>
<DOCNO> FR940104-2-00001 </DOCNO>
<PARENT> FR940104-2-00001 </PARENT>
<TEXT>

<!-- PJG FTAG 4703 -->

<!-- PJG STAG 4703 -->

<!-- PJG ITAG l=90 g=1 f=1 -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=90 g=1 f=4 -->
Federal Register
<!-- PJG /ITAG -->

<!-- PJG ITAG l=90 g=1 f=1 -->
 / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices
<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=01 g=1 f=1 -->
Vol. 59, No. 2
<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=02 g=1 f=1 -->
Tuesday, January 4, 1994
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<!-- PJG /STAG -->

<!-- PJG /FTAG -->
</TEXT>
</DOC>

I want load this XML doc into a dictionary Text. Key as DOCNO & Value as text inside tags. Also this text should not contain all the comments. Example Text['FR940104-2-00001'] must contain Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994. This is the code I wrote.

L = doc.getElementsByTagName("DOCNO")
for node2 in L:
    for node3 in node2.childNodes:
        if node3.nodeType == Node.TEXT_NODE:            
            docno.append(node3.data);
        #print node2.data
L = doc.getElementsByTagName("TEXT")
i = 0
for node2 in L:
    for node3 in node2.childNodes:
        if node3.nodeType == Node.TEXT_NODE:
            Text[docno[i]] = node3.data
    i = i+1

Surprisingly, with my code I'm getting Text['FR940104-2-00001'] as u'\n' How come?? How to get what I want

2
  • your question is not very clear Commented Sep 25, 2010 at 23:36
  • @t00ny: improved my question. Commented Sep 25, 2010 at 23:42

5 Answers 5

4

You could avoid looping through the doc twice by using xml.sax.handler:

import xml.sax.handler
import collections


class DocBuilder(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.state=''
        self.docno=''
        self.text=collections.defaultdict(list)
    def startElement(self, name, attrs):
        self.state=name
    def endElement(self, name):
        if name==u'TEXT':
            self.docno=''
    def characters(self,content):        
        content=content.strip()
        if content:
            if self.state==u'DOCNO':
                self.docno+=content
            elif self.state==u'TEXT':
                if content:
                    self.text[self.docno].append(content)


with open('test.xml') as f:
    data=f.read()            
builder = DocBuilder()
xml.sax.parseString(data, builder)
for key,value in builder.text.iteritems():
    print('{k}: {v}'.format(k=key,v=' '.join(value)))
# FR940104-2-00001: Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994
Sign up to request clarification or add additional context in comments.

3 Comments

can we use lxml for SAX parser, or how lxml.sax differs from xml.sax ?
@Tumbleweed: Yes, one could use lxml.sax.saxify instead. The syntax is almost exactly the same as for xml.sax, though you'd have to change startElement to startElementNS since lxml.sax supports namespace-aware processing only. See codespeak.net/lxml/sax.html
@Tumbleweed: Another option would be to use lxml.etree.iterparse or lxml.etree.XMLParser with a custom target. See Liza Daly's excellent article ibm.com/developerworks/xml/library/x-hiperfparse/#ibm-pcon for an example of how to do fast iterative parsing without building an entire parse tree in memory.
2

Similar to unutbu's answer, though I think simpler:

from lxml import etree
with open('test.xml') as f:
    doc=etree.parse(f)

result={}
for elm in doc.xpath("/DOC[DOCNO]"):
    key = elm.xpath("DOCNO")[0].text.strip()
    value = "".join(t.strip() for t in elm.xpath("TEXT/text()") if t.strip())
    result[key] = value

The XPath that finds the DOC element in this example needs to be changed to be appropriate for your real document - e.g. if there's a single top-level element that all the DOC elements are children of, you'd change it to /*/DOC. The predicate on that XPath skips any DOC element that doesn't have a DOCNO child, which would otherwise cause an exception when setting the key.

4 Comments

Thanks for this. I think your version is not only simpler, it also (unlike my now deleted lxml-based answer) correctly handles adjacent DOCNO s with no TEXT in between.
+1 for lxml. Much better than python's xml support in the standard library.
@unutbu: it actually doesn't handle adjacent DOCNOs at all. It finds DOC elements that have at least one DOCNO child. For each, it looks in the first DOCNO element to find the key. If there are multiple DOCNOs, it ignores all but the first. Also, if there are multiple TEXT children, it concatenates their text nodes together.
Suppose the xml had more than one pair of DOCNO and TEXT nodes. Do you see a way to modify your code to handle this case?
1

Using lxml:

import lxml.etree as le
with open('test.xml') as f:
    doc=le.parse(f)

texts={}
for docno in doc.xpath('DOCNO'):
    docno_text=docno.text.strip()    
    text=' '.join([t.strip() 
          for t in  docno.xpath('following-sibling::TEXT[1]/text()')
          if t.strip()])
    texts[docno.text]=text

print(texts)
# {'FR940104-2-00001': 'Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994'}

This version is a tad simpler than my first lxml solution. It handles multiple instances of DOCNO, TEXT nodes. The DOCNO/TEXT nodes should alternate, but in any case, the DOCNO is associated with the closest TEXT node that follows it.

Comments

0

Your line

Text[docno[i]] = node3.data

replaces the value of the mapping instead of appending the new one. Your <TEXT> node has both text and comment children, interleaved with each other.

Comments

0

DOM parser strips out the comments automatically for you. Each line is a Node.

So, You need to use:

Text[docno[i]]+= node3.data but before that you need to have an empty dictionary with all the keys. So, you can add Text[node3.data] = ''; in your first block of code.

So, your code becomes:

L = doc.getElementsByTagName("DOCNO")
for node2 in L:
    for node3 in node2.childNodes:
        if node3.nodeType == Node.TEXT_NODE:            
            docno.append(node3.data);
            Text[node3.data] = '';
        #print node2.data

L = doc.getElementsByTagName("TEXT")
i = 0
for node2 in L:
    for node3 in node2.childNodes:
        if node3.nodeType == Node.TEXT_NODE:
            Text[docno[i]]+= node3.data
    i = i+1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.