I have inherited some xml that I need to process in Python. I am using xml.etree.cElementTree, and I am having some trouble associating text that occurs after an empty element with that empty element's tag. The xml is quite a bit more complicated than I what I have pasted below, but I have simplified it to make the problem clearer (I hope!).
The result I would like to have is a dict like this:
DESIRED RESULT
{(9, 1): 'As they say, A student has usually three maladies:', (9, 2): 'poverty, itch, and pride.'}
The tuples can also contain strings (e.g., ('9', '1')). I really don't care at this early stage.
Here is the XML:
test1.xml
<div1 type="chapter" num="9">
<p>
<section num="1"/> <!-- The empty element -->
As they say, A student has usually three maladies: <!-- Here lies the trouble -->
<section num="2"/> <!-- Another empty element -->
poverty, itch, and pride.
</p>
</div1>
WHAT I HAVE TRIED
Attempt 1
>>> import xml.etree.cElementTree as ET
>>> tree = ET.parse('test1.xml')
>>> root = tree.getroot()
>>> chapter = root.attrib['num']
>>> d = dict()
>>> for p in root:
for section in p:
d[(int(chapter), int(section.attrib['num']))] = section.text
>>> d
{(9, 2): None, (9, 1): None} # This of course makes sense, since the elements are empty
Attempt 2
>>> for p in root:
for section, text in zip(p, p.itertext()): # unfortunately, p and p.itertext() are two different lengths, which also makes sense
d[(int(chapter), int(section.attrib['num']))] = text.strip()
>>> d
{(9, 2): 'As they say, A student has usually three maladies:', (9, 1): ''}
As you can see in the latter attempt, p and p.itertext() are two different lengths. The value of (9, 2) is the value I am trying to associate with key (9, 1), and the value I want to associate with (9, 2) does not even show up in d (since zip truncates the longer p.itertext()).
Any help would be appreciated. Thanks in advance.