1

I apologise if this question could be easily answered by searching and reading the lxml documentation but I have tried to no avail.

I've been using lxml's findall quite frequently to query an XML file. Recently, I've needed to use wildcards in order to extract the data I need. This has led me to using Xpath.

I've managed to get this working with ETXPath but not Xpath. I'm confused as to why. An abstract of The XML file

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<DC xmlns="http://tradefinder.db.com/Schemas/MEL/MelHorizon_0_4_2.xsd">
<Header>
    <FileName>DBL_MPA_Gap_PRD_2017-06-01T07-50-52.xml</FileName>
    <ValidityDate>2017-05-31</ValidityDate>
    <Version>0.42</Version>
    <NoOfRecords>17228</NoOfRecords>
</Header>
<Overviews>
<OverviewLevelTimeStamp>
        <Identifier>Z 1 Index, TRADE</Identifier>
        <Level>2.2120000000000002</Level>
        <Timestamp>09:00:00.000</Timestamp>
</OverviewLevelTimeStamp>
</Overviews>
</DC>

And my python code used to extract the

findshiz = ETXPath("//" + namespace + "DC/" + namespace + "Overviews/" + namespace + "OverviewLevelTimeStamp[" + namespace + "Identifier= 'Z 1 Index, TRADE']")
required_nodes = findshiz(gap_xml)

Where "gap_xml" = the parsing of the file.

This code works. For some reason when I try and use xpath it doesn't. This involves me just renaming ETXPath with xpath. The reason why is because I need to use wildcards, so instead of "Z 1 Index, TRADE", it would be Z 1 Index*.

Thanks and let me know anyways to improve the question.

2
  • 1
    What is namespace? Please show the assignment line: namespace = ... Commented Jun 16, 2017 at 17:09
  • The difference between ETXPath and the "normal" xpath (using XPath internally) is that the former expects namespaces denoted as {http://...}tagname while the latter expects a prefix prefix:tagname and an additional namespace map: {'prefix': 'http://..'}. But otherwise both should do the same. (See also lxml.de/1.3/xpathxslt.html#etxpath) Can you provide your complete code for both versions? Commented Sep 17, 2018 at 12:00

1 Answer 1

1

contains(., "Z 1 Index,") is like saying *Z1 Index*, which is a substring search.

Here is an example of using contains which is like a wildcard from xpath and map the namespace used:

       : import lxml.etree as etree

       : xstring = """
    ...: <DC xmlns="http://tradefinder.db.com/Schemas/MEL/MelHorizon_0_4_2.xsd">
    ...: <Header>
    ...:     <FileName>DBL_MPA_Gap_PRD_2017-06-01T07-50-52.xml</FileName>
    ...:     <ValidityDate>2017-05-31</ValidityDate>
    ...:     <Version>0.42</Version>
    ...:     <NoOfRecords>17228</NoOfRecords>
    ...: </Header>
    ...: <Overviews>
    ...: <OverviewLevelTimeStamp>
    ...:         <Identifier>Z 1 Index, TRADE</Identifier>
    ...:         <Level>2.2120000000000002</Level>
    ...:         <Timestamp>09:00:00.000</Timestamp>
    ...: </OverviewLevelTimeStamp>
    ...: </Overviews>
    ...: </DC>"""

 xstring = etree.fromstring(xstring)

 nsmap = {'ns': 'http://tradefinder.db.com/Schemas/MEL/MelHorizon_0_4_2.xsd'}

 print xstring.xpath('//ns:OverviewLevelTimeStamp[ns:Identifier[contains(., "Z 1 Index,")]]', namespaces=nsmap)

results in

[<Element {http://tradefinder.db.com/Schemas/MEL/MelHorizon_0_4_2.xsd}OverviewLevelTimeStamp at 0x10647aa70>]

Be aware that lxml xpath returns a list, so you have to extract the matching node from the list.

Sign up to request clarification or add additional context in comments.

2 Comments

Hi Sal, thanks for your answer. I can't really use 'contains' though as I need a wildcard for in between string searches. Also, I can't use 'tostring' here given the file is a very large xml.
@naiminp tostring was just for my example, I am not telling you to use it. Also, contains is a wildcard, prepping an edit. .

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.