Fetch partial string matched html tag using xpath

Question

The html code is blind and It contains the string "PRICE" in html. That partial string has to be matched with html text.If the text matches(partial match) using xpath.It should return the particular html tag path.

Note: I need to automate this logic for multiple sites.I should have to use the generic rule (For locating "Price",Fetching Parent tag)

This is example:

html="""<div id = "price_id">
  <span id = "id1"></span>
  <div class="price_class">
   <bold>
   <strong>
   <label>PRICE:</label> 125 Rs.
   </bold>
   </strong>
   </br>
   </br>

</div>"""

I used lxml

 from lxml.html.clean import Cleaner     

 cleaner =Cleaner(page_structure=False)
 cl = cleaner.clean_html(html)
 cleaned_html = fromstring(cl)

 for element in cleaned_html:
      if element.text == 'PRICE':
          print "matched"

How it would be written using Xpath expression?

I just need to get the div class path using xpath expression.

Also The problem is if I locate the "PRICE:" string. I should have to get the parent valid tag that is "div" with class name "price_class". but here i should have to skip or remove the unwanted tags like font,bold,italic...

Could you please suggest me to get the parent valid tag of the located string?

Nash · Accepted Answer · 2019-02-11 11:08:25Z

5

You can use the ancestor axis:

import lxml.html

html = ...
doc = lxml.html.fromstring(html)

for element in doc.xpath('//label[contains(text(), "PRICE:")]/ancestor::div[@class="price_class"]'):
    print 'Found %s: %s' % (element.tag, element.text_content().strip())

output:

Found div: PRICE: 125 Rs.

EDIT: More general solution for modified question:

doc.xpath('//*[contains(text(), "PRICE:")]/\
          ancestor::*[not(self::strong|self::bold|self::italic)][1]')

It will search for an element with the text "PRICE:" and then select the first ancestor skipping strong, bold, italic. You can add more tags to the exclude list.

Instead of an exclude list, you can search for the first good ancestor (like div, ul, etc):

doc.xpath('//*[contains(text(), "PRICE:")]/ancestor::*[self::div|self::ul][1]')

edited Feb 11, 2019 at 11:08

Nash

4621 gold badge5 silver badges17 bronze badges

answered Jan 3, 2012 at 11:22

reclosedev

9,53237 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Nava Over a year ago

Here i can't see the html source code.So,I can't use the attributes and tags manually.For another site the tags and class will be varied right? I need to automate this logic for many sites.So here instead of mentioning(label,price_class) I could have to use the gereric rule

reclosedev Over a year ago

@saravana, added more general solution to answer.

Nava Over a year ago

Thanks friend :-) I have one doubt I need to convert the case to upper for text(). I tried upper-case(text()),'price'.but it is not doing

reclosedev Over a year ago

@saravana, lxml supports XPath 1.0, upper-case() is in XPath 2.0. As workaround, you can use something like: translate(text(), "abcdefghijklmnopqrstuvwxyz","ABCDEFGHIJKLMNOPQRSTUVWXYZ")

Dimitre Novatchev · Accepted Answer · 2012-01-03 13:56:11Z

0

I just need to get the div class path using xpath expression.

Use:

//*[contains(text(), 'PRICE')]/ancestor::div[1]/@class

Also The problem is if I locate the "PRICE:" string. I should have to get the parent valid tag that is "div" with class name "price_class". but here i should have to skip or remove the unwanted tags like font,bold,italic...

XPath is a query language for XML documents. As such it cannot modify the structure of an XML document. To do so, another language (that is hosting XPath) has to be used.

XSLT is the most appropriate language for performing a transformation of an XML document, as it was especially designed with that purpose.

answered Jan 3, 2012 at 13:56

Dimitre Novatchev

244k27 gold badges307 silver badges438 bronze badges

Collectives™ on Stack Overflow

Fetch partial string matched html tag using xpath

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related