XML parsing in Python using Python 2 or 3

Question

I'm just trying to write a simple program to allow me to parse some of the following XML. So far in following examples I am not getting the results I'm looking for.

I encounter many of these XML files and I generally want the info after a handful of tags. What's the best way using elementtree to be able to do a search for <Id> and grab what ever info is in that tag. I was trying things like

for Reel in root.findall('Reel'):
...     id = Reel.findtext('Id')
...     print id

Is there a way just to look for every instance of <Id> and grab the urn: etc that comes after it? Some code that traverses everything and looks for <what I want> and so on.

This is a very truncated version of what I usually deal with.

This didn't get what I wanted at all. Is there an easy just to match <what I want> in any XML file and get the contents of that tag, or do i need to know the structure of the XML well enough to know its relation to Root/child etc?

 <Reel>
 <Id>urn:uuid:632437bc-73f9-49ca-b687-fdb3f98f430c</Id>
 <AssetList>
  <MainPicture>
   <Id>urn:uuid:46afe8a3-50be-4986-b9c8-34f4ba69572f</Id>
   <EditRate>24 1</EditRate>
   <IntrinsicDuration>340</IntrinsicDuration>
   <EntryPoint>0</EntryPoint>
   <Duration>340</Duration>
   <FrameRate>24 1</FrameRate>
   <ScreenAspectRatio>2048 858</ScreenAspectRatio>
  </MainPicture>
  <MainSound>
   <Id>urn:uuid:1fce0915-f8c7-48a7-b023-36e204a66ed1</Id>
   <EditRate>24 1</EditRate>
   <IntrinsicDuration>340</IntrinsicDuration>
   <EntryPoint>0</EntryPoint>
   <Duration>340</Duration>
  </MainSound>
 </AssetList>
</Reel>

@Mata that worked perfectly, but when I tried to use that for different values on another XML file I fell flat on my face. For instance, what about this section of a file.I couldn't post the whole thing unfortunately. What if I want to grab what comes after KeyId?

<?xml version="1.0" encoding="UTF-8" standalone="no" ?><DCinemaSecurityMessage xmlns="http://www.digicine.com/PROTO-ASDCP-KDM-20040311#" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
  <!-- Generated by Wailua Version 0.3.20 -->
  <AuthenticatedPublic Id="ID_AuthenticatedPublic">
    <MessageId>urn:uuid:7bc63f4c-c617-4d00-9e51-0c8cd6a4f59e</MessageId>
    <MessageType>http://www.digicine.com/PROTO-ASDCP-KDM-20040311#</MessageType>
    <AnnotationText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE ~ KDM for Quvis-10010.pem</AnnotationText>
    <IssueDate>2007-04-29T04:13:43-00:00</IssueDate>
    <Signer>
      <dsig:X509IssuerName>dnQualifier=BzC0n/VV/uVrl2PL3uggPJ9va7Q=,CN=.deluxe-admin-c,OU=.mxf-j2c.ca.cinecert.com,O=.ca.cinecert.com</dsig:X509IssuerName>
      <dsig:X509SerialNumber>10039</dsig:X509SerialNumber>
    </Signer>
    <RequiredExtensions>
      <Recipient>
        <X509IssuerSerial>
          <dsig:X509IssuerName>dnQualifier=RUxyQle0qS7qPbcNRFBEgVjw0Og=,CN=SM.QuVIS.com.001,OU=QuVIS Digital Cinema,O=QuVIS.com</dsig:X509IssuerName>
          <dsig:X509SerialNumber>363</dsig:X509SerialNumber>
        </X509IssuerSerial>
        <X509SubjectName>CN=SM MD LE FM.QuVIS_CinemaPlayer-3d_10010,OU=QuVIS,O=QuVIS.com,dnQualifier=3oBfjTfx1me0p1ms7XOX\+eqUUtE=</X509SubjectName>
      </Recipient>
      <CompositionPlaylistId>urn:uuid:336263da-e4f1-324e-8e0c-ebea00ff79f4</CompositionPlaylistId>
      <ContentTitleText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE</ContentTitleText>
      <ContentKeysNotValidBefore>2007-04-30T05:00:00-00:00</ContentKeysNotValidBefore>
      <ContentKeysNotValidAfter>2007-04-30T10:00:00-00:00</ContentKeysNotValidAfter>
      <KeyIdList>
        <KeyId>urn:uuid:9851b0f6-4790-0d4c-a69d-ea8abdedd03d</KeyId>
        <KeyId>urn:uuid:8317e8f3-1597-494d-9ed8-08a751ff8615</KeyId>
        <KeyId>urn:uuid:5d9b228d-7120-344c-aefc-840cdd32bbfc</KeyId>
        <KeyId>urn:uuid:1e32ccb2-ab0b-9d43-b879-1c12840c178b</KeyId>
        <KeyId>urn:uuid:44d04416-676a-2e4f-8995-165de8cab78d</KeyId>
        <KeyId>urn:uuid:906da0c1-b0cb-4541-b8a9-86476583cdc4</KeyId>
        <KeyId>urn:uuid:0fe2d73a-ebe3-9844-b3de-4517c63c4b90</KeyId>
        <KeyId>urn:uuid:862fa79a-18c7-9245-a172-486541bef0c0</KeyId>
        <KeyId>urn:uuid:aa2f1a88-7a55-894d-bc19-42afca589766</KeyId>
        <KeyId>urn:uuid:59d6eeff-cd56-6245-9f13-951554466626</KeyId>
        <KeyId>urn:uuid:14a13b1a-76ba-764c-97d0-9900f58af53e</KeyId>
        <KeyId>urn:uuid:ccdbe0ae-1c3f-224c-b450-947f43bbd640</KeyId>
        <KeyId>urn:uuid:dcd37f10-b042-8e44-bef0-89bda2174842</KeyId>
        <KeyId>urn:uuid:9dd7103e-7e5a-a840-a15f-f7d7fe699203</KeyId>
      </KeyIdList>
    </RequiredExtensions>
    <NonCriticalExtensions/>
  </AuthenticatedPublic>
  <AuthenticatedPrivate Id="ID_AuthenticatedPrivate"><enc:EncryptedKey xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<enc:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p">
<ds:DigestMethod xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
</enc:EncryptionMethod>

mata · Accepted Answer · 2013-06-13 21:47:42Z

1

The expression Reel.findtext('Id') only matches direct children of Reel. If you want to find all Id tags in your xml document, you can just use:

ids = [id.text for id in Reel.findall(".//Id")]

This would give you a list of all text nodes of all Id tags which are children of Reel.

edit: Your updated example uses namespaces, in this case KeyId is in the default namespace (http://www.digicine.com/PROTO-ASDCP-KDM-20040311#), so to search for it you need to include it in your search:

from xml.etree import ElementTree

doc = ElementTree.parse('test.xml')
nsmap = {'ns': 'http://www.digicine.com/PROTO-ASDCP-KDM-20040311#'}
ids = [id.text for id in doc.findall(".//ns:KeyId", namespaces=nsmap)]
print(ids)
...

The xpath subset ElementTree supports is rather limited. If you want a more complete support, you should use lxml instead, it's xpath support is way more complete.

For example, using xpath to search for all KeyId tags (ignoring namespaces) and returning their text content directly:

from lxml import etree
doc = etree.parse('test.xml')
ids = doc.xpath(".//*[local-name()='KeyId']/text()")
print(ids)
...

edited Jun 13, 2013 at 21:47

answered Jun 13, 2013 at 19:51

mata

69.3k10 gold badges168 silver badges162 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

inbinder Over a year ago

but say I have a bunch of Id fields in the XML doc and I want the text/data that comes after Id. Sorry, I feel like I don't know enough about XML to use the correct vocab.

mata Over a year ago

@user1124541 - I've added more examples, hope it helps. btw, your second xml fragment isn't valid, it misses some tags...

inbinder Over a year ago

@mata-thanks. I'm realizing i just don't really understand the structure of these types of xml files. I wish i could post all of the file to ask more questions etc.

girasquid · Accepted Answer · 2013-06-13 19:43:00Z

1

It sounds like XPath might be right up your alley - it will let you query your XML document for exactly what you're looking for, as long as you know the structure.

answered Jun 13, 2013 at 19:43

girasquid

15.6k3 gold badges51 silver badges59 bronze badges

Comments

inbinder · Accepted Answer · 2013-06-14 14:37:17Z

0

Here's what I needed to do. This works for finding whatever I need.

for node in tree.getiterator():
...     if 'KeyId' in node.tag:
...             mylist = node.tag
...             print(mylist)
...

answered Jun 14, 2013 at 14:37

inbinder

7204 gold badges12 silver badges30 bronze badges

Collectives™ on Stack Overflow

XML parsing in Python using Python 2 or 3

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related