1

The idea is to get the value of tag endTime for the following xml:

<epochs xmlns="http://www.egi.com/epochs_mff" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <epoch>
    <beginTime>0</beginTime>
    <endTime>3586221000</endTime>
    <firstBlock>1</firstBlock>
    <lastBlock>897</lastBlock>
  </epoch>
  <epoch>
    <beginTime>3750143000</beginTime>
    <endTime>5549485000</endTime>
    <firstBlock>898</firstBlock>
    <lastBlock>1347</lastBlock>
  </epoch>
</epochs>

Yet, accessing the tag directly return an empty list:

import xml.etree.ElementTree as ET
tree = ET.parse(r'epochs.xml')
epoch_list=tree.findall("epoch")

However, looping through the tree does return the endTime value.

import xml.etree.ElementTree as ET
tree = ET.parse(r'epochs.xml')

for elem in tree:
    for subelem in elem:
        print(subelem.text)

May I know how can I retrieve directly the endTime with the value of 300937000?

2
  • 1
    Check your second code block. The third line doesn't seem to be complete Commented Jul 19, 2020 at 13:28
  • Dirty work around is to Parse XML Files Using Python’s BeautifulSoup using the line result = soup_page.find_all("endtime"). Commented Jul 19, 2020 at 14:10

1 Answer 1

1

The reason your code failed is that your XML uses a default namespace (xmlns="http://...").

But your call to findall contains epoch without any namespace, so it is not likely to find anything.

To process namespaced XML, you have to:

  • create a dictionary of used namespaces ({prefix: namespace}),
  • include the prefix of the relevant namespace in the XPath expression,
  • pass the above dictionary as the second argument of findall.

Something like:

ns = {'ep': 'http://www.egi.com/epochs_mff'}
epoch_list = tree.findall('ep:epoch', ns)

Then the result is:

[<Element '{http://www.egi.com/epochs_mff}epoch' at 0x...>]

And to get the content your endTime element, if you don't care about any intermediate elements in the XML tree, run:

tree.findtext('.//ep:endTime', namespaces=ns)

Other choice is to pass full XML path, starting from the content of the root element, but remember about the namespace prefix at each step:

tree.findtext('ep:epoch/ep:endTime', namespaces=ns)

If you have multiple endTime elements, one of possible solutions is to process them in a loop.

This time findtext is useless as it finds only the first matching element. You should use a loop based on findall and then (within the loop) retrieve the text of the current element and make the intended use of it, e.g.:

for it in tree.findall('ep:epoch/ep:endTime', namespaces=ns):
    print(it.text)

Of course, replace print with whatever you need to consume the text found.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the detail explanation @Valdi_Bo. Just to extend the discussion further. How to loop the tree.findtext('ep:epoch/ep:endTime', namespaces=ns) if there exist more than two endTime instances?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.