5

I have a python script used to parse XMLs and export into a csv file certain elements of interest. I have tried to now change the script to allow the filtering of an XML file under a criteria, the equivalent XPath query would be:

\DC\Events\Confirmation[contains(TransactionId,"GTEREVIEW")]

When I try to use lxml to do so, my code is:

xml_file = lxml.etree.parse(xml_file_path)
namespace = "{" + xml_file.getroot().nsmap[None] + "}"
node_list = xml_file.findall(namespace + "Events/" + namespace + "Confirmation[TransactionId='*GTEREVIEW*']")

But this doesn't seem to work. Can anyone help? Example of XML file:

<Events>
  <Confirmation>
    <TransactionId>GTEREVIEW2012</TransactionId>
  </Confirmation>    
  <Confirmation>
    <TransactionId>GTEDEF2012</TransactionId>
  </Confirmation>    
</Events> 

So I want all "Confirmation" nodes that contain a transaction Id which includes the string "GTEREVIEW". Thanks

2
  • where is your xml file ? Commented Nov 15, 2016 at 20:38
  • I've updated the question. Commented Nov 15, 2016 at 23:23

2 Answers 2

10

findall() doesn't support XPath expressions, only ElementPath (see https://web.archive.org/web/20200504162744/http://effbot.org/zone/element-xpath.htm). ElementPath doesn't support searching for elements containing a certain string.

Why don't you use XPath? Assuming that the file test.xml contains your sample XML, the following works:

> python
Python 2.7.9 (default, Jun 29 2016, 13:08:31) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> from lxml import etree
>>> tree=etree.parse("test.xml")
>>> tree.xpath("Confirmation[starts-with(TransactionId, 'GTEREVIEW')]")
[<Element Confirmation at 0x7f68b16c3c20>]

If you insist on using findall(), the best you can do is get the list of all Confirmation elements having a TransactionId child node:

>>> tree.findall("Confirmation[TransactionId]")
[<Element Confirmation at 0x7f68b16c3c20>, <Element Confirmation at 0x7f68b16c3ea8>]

You then need to filter this list manually, e.g.:

>>> [e for e in tree.findall("Confirmation[TransactionId]")
     if e[0].text.startswith('GTEREVIEW')]
[<Element Confirmation at 0x7f68b16c3c20>]

If your document contains namespaces, the following will get you all Confirmation elements having a TransactionId child node, provided that the elements use the default namespace (I used xmlns="file:xyz" as the default namespace):

>>> tree.findall("//{{{0}}}Confirmation[{{{0}}}TransactionId]".format(tree.getroot().nsmap[None]))
[<Element {file:xyz}Confirmation at 0x7f534a85d1b8>, <Element {file:xyz}Confirmation at 0x7f534a85d128>]

And there is of course etree.ETXPath:

>>> find=etree.ETXPath("//{{{0}}}Confirmation[starts-with({{{0}}}TransactionId, 'GTEREVIEW')]".format(tree.getroot().nsmap[None]))
>>> find(tree)
[<Element {file:xyz}Confirmation at 0x7f534a85d1b8>]

This allows you to combine XPath and namespaces.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot for your answer! Sadly, there is a namespace involved in my doc which results in the Xpath returning an empty list. After removing the namespace from the file, the code appears to work. Is there a way around this? The file essentially begins with <DC xmlns="tradefinder.db.com/Schemas/MEL/CapitaHorizon_0_9_2.xsd" xmlns:xs="w3.org/2001/XMLSchema"> And ends with </DC>
Thought so. You can still use the second approach with findall() then. You just need to do the filtering on the node list returned.
0
//Confirmation[TransactionId[contains(.,'GTEREVIEW')]]


father_tag[child_tag]  # select father_tag that has child_tag
[child_tag[filter]]    # select select child tag which match filter
[filter]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.