1

I have inherited some xml that I need to process in Python. I am using xml.etree.cElementTree, and I am having some trouble associating text that occurs after an empty element with that empty element's tag. The xml is quite a bit more complicated than I what I have pasted below, but I have simplified it to make the problem clearer (I hope!).

The result I would like to have is a dict like this:

DESIRED RESULT

{(9, 1): 'As they say, A student has usually three maladies:', (9, 2): 'poverty, itch, and pride.'}

The tuples can also contain strings (e.g., ('9', '1')). I really don't care at this early stage.

Here is the XML:

test1.xml

<div1 type="chapter" num="9">
  <p>
    <section num="1"/> <!-- The empty element -->
      As they say, A student has usually three maladies: <!-- Here lies the trouble -->
    <section num="2"/> <!-- Another empty element -->
      poverty, itch, and pride.
  </p>
</div1>

WHAT I HAVE TRIED

Attempt 1

>>> import xml.etree.cElementTree as ET
>>> tree = ET.parse('test1.xml')
>>> root = tree.getroot()
>>> chapter = root.attrib['num']
>>> d = dict()
>>> for p in root:
    for section in p:
        d[(int(chapter), int(section.attrib['num']))] = section.text


>>> d
{(9, 2): None, (9, 1): None}    # This of course makes sense, since the elements are empty

Attempt 2

>>> for p in root:
    for section, text in zip(p, p.itertext()):    # unfortunately, p and p.itertext() are two different lengths, which also makes sense
        d[(int(chapter), int(section.attrib['num']))] = text.strip()


>>> d
{(9, 2): 'As they say, A student has usually three maladies:', (9, 1): ''}

As you can see in the latter attempt, p and p.itertext() are two different lengths. The value of (9, 2) is the value I am trying to associate with key (9, 1), and the value I want to associate with (9, 2) does not even show up in d (since zip truncates the longer p.itertext()).

Any help would be appreciated. Thanks in advance.

2 Answers 2

1

Have you tried using .tail?

import xml.etree.cElementTree as ET

txt = """<div1 type="chapter" num="9">
         <p>
           <section num="1"/> <!-- The empty element -->
             As they say, A student has usually three maladies: <!-- Here lies the trouble -->
           <section num="2"/> <!-- Another empty element -->
             poverty, itch, and pride.
         </p>
         </div1>"""
root = ET.fromstring(txt)
for p in root:
    for s in p:
        print s.attrib['num'], s.tail
Sign up to request clarification or add additional context in comments.

1 Comment

Brilliant. Worked like a charm. Thanks.
0

I would use BeautifulSoup for this:

from bs4 import BeautifulSoup

html_doc = """<div1 type="chapter" num="9">
  <p>
    <section num="1"/>
      As they say, A student has usually three maladies:
    <section num="2"/>
      poverty, itch, and pride.
  </p>
</div1>"""

soup = BeautifulSoup(html_doc)

result = {}
for chapter in soup.find_all(type='chapter'):
    for section in chapter.find_all('section'):
      result[(chapter['num'], section['num'])] = section.next_sibling.strip()

import pprint
pprint.pprint(result)

This prints:

{(u'9', u'1'): u'As they say, A student has usually three maladies:',
 (u'9', u'2'): u'poverty, itch, and pride.'}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.