Using XPath in Python with LXML

Question

I have a python script used to parse XMLs and export into a csv file certain elements of interest. I have tried to now change the script to allow the filtering of an XML file under a criteria, the equivalent XPath query would be:

\DC\Events\Confirmation[contains(TransactionId,"GTEREVIEW")]

When I try to use lxml to do so, my code is:

xml_file = lxml.etree.parse(xml_file_path)
namespace = "{" + xml_file.getroot().nsmap[None] + "}"
node_list = xml_file.findall(namespace + "Events/" + namespace + "Confirmation[TransactionId='*GTEREVIEW*']")

But this doesn't seem to work. Can anyone help? Example of XML file:

<Events>
  <Confirmation>
    <TransactionId>GTEREVIEW2012</TransactionId>
  </Confirmation>    
  <Confirmation>
    <TransactionId>GTEDEF2012</TransactionId>
  </Confirmation>    
</Events>

So I want all "Confirmation" nodes that contain a transaction Id which includes the string "GTEREVIEW". Thanks

where is your xml file ?

SomeDude
– SomeDude

2016-11-15 20:38:55 +00:00
Commented Nov 15, 2016 at 20:38 — SomeDude
– SomeDude, Commented Nov 15, 2016 at 20:38
I've updated the question.

naiminp
– naiminp

2016-11-15 23:23:07 +00:00
Commented Nov 15, 2016 at 23:23 — naiminp
– naiminp, Commented Nov 15, 2016 at 23:23

Markus · Accepted Answer · 2020-11-29 11:55:21Z

10

findall() doesn't support XPath expressions, only ElementPath (see https://web.archive.org/web/20200504162744/http://effbot.org/zone/element-xpath.htm). ElementPath doesn't support searching for elements containing a certain string.

Why don't you use XPath? Assuming that the file test.xml contains your sample XML, the following works:

> python
Python 2.7.9 (default, Jun 29 2016, 13:08:31) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> from lxml import etree
>>> tree=etree.parse("test.xml")
>>> tree.xpath("Confirmation[starts-with(TransactionId, 'GTEREVIEW')]")
[<Element Confirmation at 0x7f68b16c3c20>]

If you insist on using findall(), the best you can do is get the list of all Confirmation elements having a TransactionId child node:

>>> tree.findall("Confirmation[TransactionId]")
[<Element Confirmation at 0x7f68b16c3c20>, <Element Confirmation at 0x7f68b16c3ea8>]

You then need to filter this list manually, e.g.:

>>> [e for e in tree.findall("Confirmation[TransactionId]")
     if e[0].text.startswith('GTEREVIEW')]
[<Element Confirmation at 0x7f68b16c3c20>]

If your document contains namespaces, the following will get you all Confirmation elements having a TransactionId child node, provided that the elements use the default namespace (I used xmlns="file:xyz" as the default namespace):

>>> tree.findall("//{{{0}}}Confirmation[{{{0}}}TransactionId]".format(tree.getroot().nsmap[None]))
[<Element {file:xyz}Confirmation at 0x7f534a85d1b8>, <Element {file:xyz}Confirmation at 0x7f534a85d128>]

And there is of course etree.ETXPath:

>>> find=etree.ETXPath("//{{{0}}}Confirmation[starts-with({{{0}}}TransactionId, 'GTEREVIEW')]".format(tree.getroot().nsmap[None]))
>>> find(tree)
[<Element {file:xyz}Confirmation at 0x7f534a85d1b8>]

This allows you to combine XPath and namespaces.

edited Nov 29, 2020 at 11:55

answered Nov 16, 2016 at 8:09

Markus

3,3972 gold badges27 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

naiminp Over a year ago

Thanks a lot for your answer! Sadly, there is a namespace involved in my doc which results in the Xpath returning an empty list. After removing the namespace from the file, the code appears to work. Is there a way around this? The file essentially begins with <DC xmlns="tradefinder.db.com/Schemas/MEL/CapitaHorizon_0_9_2.xsd" xmlns:xs="w3.org/2001/XMLSchema"> And ends with </DC>

Markus Over a year ago

Thought so. You can still use the second approach with findall() then. You just need to do the filtering on the node list returned.

宏杰李 · Accepted Answer · 2016-11-16 13:47:09Z

0

//Confirmation[TransactionId[contains(.,'GTEREVIEW')]]


father_tag[child_tag]  # select father_tag that has child_tag
[child_tag[filter]]    # select select child tag which match filter
[filter]

edited Nov 16, 2016 at 13:47

answered Nov 16, 2016 at 8:33

宏杰李

12.2k2 gold badges32 silver badges37 bronze badges

Collectives™ on Stack Overflow

Using XPath in Python with LXML

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related