The standard implementation of ElementTree for Python (2.6) does not provide pointers to parents from child nodes. Therefore, if parents are needed, it is suggested to loop over parents rather than children.
Consider my xml is of the form:
<Content>
<Para>first</Para>
<Table><Para>second</Para></Table>
<Para>third</Para>
</Content>
The following finds all "Para" nodes without considering parents:
(1) paras = [p for p in page.getiterator("Para")]
This (adapted from effbot) stores the parent by looping over them instead of the child nodes:
(2) paras = [(c,p) for p in page.getiterator() for c in p]
This makes perfect sense, and can be extended with a conditional to achieve the (supposedly) same result as (1), but with parent info added:
(3) paras = [(c,p) for p in page.getiterator() for c in p if c.tag == "Para"]
The ElementTree documentation suggests that the getiterator() method does a depth-first search. Running it without looking for the parent (1) yields:
first
second
third
However, extracting the text from paras in (3), yields:
first, Content>Para
third, Content>Para
second, Table>Para
This appears to be breadth-first.
This therefore raises two questions.
- Is this correct and expected behaviour?
- How do you extract (parent, child) tuples when the child must be of a certain type but the parent can be anything, if document order must be maintained. I do not think running two loops and mapping the (parent,child)'s generated by (3) to the orders generated by (1) is ideal.