0

I was trying to scrape a website using python request and lxml. I could easily select the elements with single class using html.xpath() but I can't figure out how to select the elements with multiple class.

I used some code like this to select the elements in page with class "title":

page.xpath('//a[@class="title"]')

However, I couldn't select elements with multiple classes. I checked some few codes. I tried to study xpath but it seemes like lxml.html.xpath() works different, may be it's my lack of understanding. I tried few codes which didnt' work for me. They are given below.

HTML code

<a href="https://www.lovemycosmetic.de/skin1004-madagascar-centella-ampoule-30ml-" class="info text-center" title="SKIN1004 Madagascar Centella Ampoule 30ml"> <strong class="supplier"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004</font></font></strong><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004 Madagascar Centella Ampoule 30ml</font></font></a>

Test 1:

page.xpath('//a[@class="info text-center"]')

Test 2:

page.xpath("//a[@class='info text-center']")

Test 3:

page.xpath('//a[@class="info.text-center"]')

Test 4:

page.xpath("//a[contains(@class, 'info') and contains(@class, 'text-center')]")

I did couple more tests too but I forgot to save the code. It will be great to know how to select elements with multiple classes using lxml.html.xpath().

5
  • 1
    post the html snippet you are trying to parse... Commented Dec 17, 2022 at 19:38
  • @Alexander I have edited my question. Would you mind checking it. Commented Dec 17, 2022 at 19:45
  • 1
    Not the python code... the html. Either a post the portion that contains the element you are trying to extract or a link to the website that contains it. The reason I want to see the html is because you test1 test2 all look accurate, but without seeing the html its impossible to say why they aren't working Commented Dec 17, 2022 at 19:46
  • Your Test 1. should work fine... It does for me Commented Dec 17, 2022 at 19:59
  • Test2 works for me. a = page.xpath('//a[@class="info text-center"]') print(a[0].text) Commented Dec 17, 2022 at 20:00

2 Answers 2

1

NB as far as XPath is concerned, the class attribute's value is a string like any other. It doesn't automatically parse the value as a list of space-delimited tokens, as a CSS selector would. In later versions of XPath you have the function contains-token() but lxml supports XPath 1.0 in which you basically have to tokenize the class value yourself.

If your class values are literally info text-center then you can test it with the predicate [@class="info text-center"], but that won't match a class value of e.g. text-center info or info text-center foo bar. I'd recommend you use the XPath contains() function, e.g.

//a[contains(@class, "info")][contains(@class, "text-center")]
Sign up to request clarification or add additional context in comments.

Comments

1

Your test1 and test2 should both work fine, this is the code I used to get the results.

from lxml.html import etree
root = etree.fromstring('<a href="https://www.lovemycosmetic.de/skin1004-madagascar-centella-ampoule-30ml-" class="info text-center" title="SKIN1004 Madagascar Centella Ampoule 30ml"> <strong class="supplier"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004</font></font></strong><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004 Madagascar Centella Ampoule 30ml</font></font></a>')
elem = root.xpath('//a[@class="info text-center"]')[0]
url = elem.xpath('./@href')[0]
print(elem, url)

OUTPUT:

<Element a at 0x1ef01509940> https://www.lovemycosmetic.de/skin1004-madagascar-centella-ampoule-30ml-

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.