Python parse xml doesn't find element [duplicate]

Question

I'm trying to find a tag using xml.etree.ElementTree. I don't know the exact position so I've to search for it.

The input are NuGet-Specifications for .Net NuGet packages.

I used this code to find the element but it doesn't find it:

import xml.etree.ElementTree as ET

content = ......

tree = ET.fromstring(content)

# none of the following lines are working
tag = tree.find('licenseUrl')
tags = tree.findall('*/licenseUrl')
tags = tree.findall('.//licenseUrl')
tags = tree.findall('licenseUrl')

But len(tags) is always 0.

If I'm using regex to find it, it works like a charm:

re.search(r'<licenseUrl>(?P<url>.*?)</licenseUrl>', content, flags=re.DOTALL or re.MULTILINE)

But it's not recommended to use regex to parse xml.

What am I doing wrong?

DEMO that shows the working code.

I was using the following information without luck:

For completeness the content of content:

<?xml version="1.0" encoding="utf-8"?>
<package xmlns="http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd">
  <metadata>
    <id>AutoMapper</id>
    <version>9.0.0</version>
    <authors>Jimmy Bogard</authors>
    <owners>Jimmy Bogard</owners>
    <requireLicenseAcceptance>false</requireLicenseAcceptance>
    <licenseUrl>https://github.com/AutoMapper/AutoMapper/blob/master/LICENSE.txt</licenseUrl>
    <projectUrl>https://automapper.org/</projectUrl>
    <iconUrl>https://s3.amazonaws.com/automapper/icon.png</iconUrl>
    <description>A convention-based object-object mapper.</description>
    <repository type="git" url="https://github.com/AutoMapper/AutoMapper" commit="53faf3f014802b502f6a49b4c94368f478752f59" />
    <dependencies>
      <group targetFramework=".NETFramework4.6.1" />
      <group targetFramework=".NETStandard2.0">
        <dependency id="Microsoft.CSharp" version="4.5.0" exclude="Build,Analyzers" />
        <dependency id="System.Reflection.Emit" version="4.3.0" exclude="Build,Analyzers" />
      </group>
    </dependencies>
    <frameworkAssemblies>
      <frameworkAssembly assemblyName="Microsoft.CSharp" targetFramework=".NETFramework4.6.1" />
    </frameworkAssemblies>
  </metadata>
</package>

You are not taking the namespace into account. See docs.python.org/3/library/… — mzjn
– mzjn, Commented Apr 23, 2020 at 11:49
@mzjn Thx. Maybe that's the problem but how do I ignore them? I don't care about the namespace and I'm pretty sure that they differ between xmls of different versions. — Sebastian Schumann
– Sebastian Schumann, Commented Apr 23, 2020 at 11:52
There have been many questions about processing XML with namespaces. See for example stackoverflow.com/a/61154644/407651 — mzjn
– mzjn, Commented Apr 23, 2020 at 11:55

Shmygol · Accepted Answer · 2020-04-23 12:13:52Z

2

Your XML has a default name space which you are not taking in account. This code should work:

import xml.etree.ElementTree as ET

content = ......

tree = ET.fromstring(content)
ns = {'ms': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}
tags = tree.findall('.//ms:licenseUrl', ns)

UPDATE: Or, as @mzjn mentioned in the comments, just use {*} if you really don't care about the name spaces:

import xml.etree.ElementTree as ET

content = ......

tree = ET.fromstring(content)
tags = tree.findall('.//{*}licenseUrl')

edited Apr 23, 2020 at 12:13

answered Apr 23, 2020 at 12:01

Shmygol

9537 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python parse xml doesn't find element [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related