4

The html code is blind and It contains the string "PRICE" in html. That partial string has to be matched with html text.If the text matches(partial match) using xpath.It should return the particular html tag path.

Note: I need to automate this logic for multiple sites.I should have to use the generic rule (For locating "Price",Fetching Parent tag)

This is example:

html="""<div id = "price_id">
  <span id = "id1"></span>
  <div class="price_class">
   <bold>
   <strong>
   <label>PRICE:</label> 125 Rs.
   </bold>
   </strong>
   </br>
   </br>

</div>"""

I used lxml

 from lxml.html.clean import Cleaner     

 cleaner =Cleaner(page_structure=False)
 cl = cleaner.clean_html(html)
 cleaned_html = fromstring(cl)

 for element in cleaned_html:
      if element.text == 'PRICE':
          print "matched"

How it would be written using Xpath expression?

I just need to get the div class path using xpath expression.

Also The problem is if I locate the "PRICE:" string. I should have to get the parent valid tag that is "div" with class name "price_class". but here i should have to skip or remove the unwanted tags like font,bold,italic...

Could you please suggest me to get the parent valid tag of the located string?

2 Answers 2

5

You can use the ancestor axis:

import lxml.html

html = ...
doc = lxml.html.fromstring(html)

for element in doc.xpath('//label[contains(text(), "PRICE:")]/ancestor::div[@class="price_class"]'):
    print 'Found %s: %s' % (element.tag, element.text_content().strip())

output:

Found div: PRICE: 125 Rs.

EDIT: More general solution for modified question:

doc.xpath('//*[contains(text(), "PRICE:")]/\
          ancestor::*[not(self::strong|self::bold|self::italic)][1]')

It will search for an element with the text "PRICE:" and then select the first ancestor skipping strong, bold, italic. You can add more tags to the exclude list.

Instead of an exclude list, you can search for the first good ancestor (like div, ul, etc):

doc.xpath('//*[contains(text(), "PRICE:")]/ancestor::*[self::div|self::ul][1]')
Sign up to request clarification or add additional context in comments.

4 Comments

Here i can't see the html source code.So,I can't use the attributes and tags manually.For another site the tags and class will be varied right? I need to automate this logic for many sites.So here instead of mentioning(label,price_class) I could have to use the gereric rule
@saravana, added more general solution to answer.
Thanks friend :-) I have one doubt I need to convert the case to upper for text(). I tried upper-case(text()),'price'.but it is not doing
@saravana, lxml supports XPath 1.0, upper-case() is in XPath 2.0. As workaround, you can use something like: translate(text(), "abcdefghijklmnopqrstuvwxyz","ABCDEFGHIJKLMNOPQRSTUVWXYZ")
0

I just need to get the div class path using xpath expression.

Use:

//*[contains(text(), 'PRICE')]/ancestor::div[1]/@class

Also The problem is if I locate the "PRICE:" string. I should have to get the parent valid tag that is "div" with class name "price_class". but here i should have to skip or remove the unwanted tags like font,bold,italic...

XPath is a query language for XML documents. As such it cannot modify the structure of an XML document. To do so, another language (that is hosting XPath) has to be used.

XSLT is the most appropriate language for performing a transformation of an XML document, as it was especially designed with that purpose.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.