3

I'm trying to find a tag using xml.etree.ElementTree. I don't know the exact position so I've to search for it.

The input are NuGet-Specifications for .Net NuGet packages.

I used this code to find the element but it doesn't find it:

import xml.etree.ElementTree as ET

content = ......

tree = ET.fromstring(content)

# none of the following lines are working
tag = tree.find('licenseUrl')
tags = tree.findall('*/licenseUrl')
tags = tree.findall('.//licenseUrl')
tags = tree.findall('licenseUrl')

But len(tags) is always 0.

If I'm using regex to find it, it works like a charm:

re.search(r'<licenseUrl>(?P<url>.*?)</licenseUrl>', content, flags=re.DOTALL or re.MULTILINE) 

But it's not recommended to use regex to parse xml.

What am I doing wrong?

DEMO that shows the working code.

I was using the following information without luck:

For completeness the content of content:

<?xml version="1.0" encoding="utf-8"?>
<package xmlns="http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd">
  <metadata>
    <id>AutoMapper</id>
    <version>9.0.0</version>
    <authors>Jimmy Bogard</authors>
    <owners>Jimmy Bogard</owners>
    <requireLicenseAcceptance>false</requireLicenseAcceptance>
    <licenseUrl>https://github.com/AutoMapper/AutoMapper/blob/master/LICENSE.txt</licenseUrl>
    <projectUrl>https://automapper.org/</projectUrl>
    <iconUrl>https://s3.amazonaws.com/automapper/icon.png</iconUrl>
    <description>A convention-based object-object mapper.</description>
    <repository type="git" url="https://github.com/AutoMapper/AutoMapper" commit="53faf3f014802b502f6a49b4c94368f478752f59" />
    <dependencies>
      <group targetFramework=".NETFramework4.6.1" />
      <group targetFramework=".NETStandard2.0">
        <dependency id="Microsoft.CSharp" version="4.5.0" exclude="Build,Analyzers" />
        <dependency id="System.Reflection.Emit" version="4.3.0" exclude="Build,Analyzers" />
      </group>
    </dependencies>
    <frameworkAssemblies>
      <frameworkAssembly assemblyName="Microsoft.CSharp" targetFramework=".NETFramework4.6.1" />
    </frameworkAssemblies>
  </metadata>
</package>
6
  • 1
    You are not taking the namespace into account. See docs.python.org/3/library/… Commented Apr 23, 2020 at 11:49
  • @mzjn Thx. Maybe that's the problem but how do I ignore them? I don't care about the namespace and I'm pretty sure that they differ between xmls of different versions. Commented Apr 23, 2020 at 11:52
  • 1
    There have been many questions about processing XML with namespaces. See for example stackoverflow.com/a/61154644/407651 Commented Apr 23, 2020 at 11:55
  • @mzjn Okay. Support was added in 3.8. I've to test it. Commented Apr 23, 2020 at 12:01
  • 1
    @mzjn Thx. The support for .//{*}xxx works. Commented Apr 23, 2020 at 12:10

1 Answer 1

2

Your XML has a default name space which you are not taking in account. This code should work:

import xml.etree.ElementTree as ET

content = ......

tree = ET.fromstring(content)
ns = {'ms': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}
tags = tree.findall('.//ms:licenseUrl', ns)

UPDATE: Or, as @mzjn mentioned in the comments, just use {*} if you really don't care about the name spaces:

import xml.etree.ElementTree as ET

content = ......

tree = ET.fromstring(content)
tags = tree.findall('.//{*}licenseUrl')
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.