0

I am attempting to scrape the href from the following HTML, but I need the second data class to identify the href:

<tr>
<td class="data">
    <a target="_new" title="Title" href="https://somesite.com/file_to_scrape.pdf">Scraped Class</a>
<br>
</td>
<td class="data">Text to Identify Above Link</td>
<td class="data">Not relevant text</td>
</tr>

The first thing I do is pull back a list of all classes that are named data:

ls_class = driver.find_elements_by_class_name("data")

but when I loop through:

for clas in ls_class:
   print(clas.text)
   print(clas.get_attribute('href'))

The print out is:

Scraped Class
None
Text to Identify Above Link
None
Not Relevant Text
None

How can I get the nested href when present in a data class?

2 Answers 2

1

Instead of getting

ls_class = driver.find_elements_by_class_name("data")

You can get directly

elements = driver.find_elements_by_xpath("//td[@class='data']//a")
for element in elements:
   print(element.text)
   print(element.get_attribute('href'))

UPD
I think you can get the desired element directly by this code:

element = driver.find_elements_by_xpath("//tr[.//td[@class='data'][text()='Text to Identify Above Link']//td[@class='data']//a[@href]")
print(element.get_attribute('href'))
Sign up to request clarification or add additional context in comments.

13 Comments

When I do this, it doesnt return the second class which has text I need to identify the prior link. I only get classes with href.
Moment, maybe I misunderstood you. is all you want is to get the "https://somesite.com/file_to_scrape.pdf value (this value is unknown, we have to get it) while this a element is inside some td element so that the next sibling of this td is td with known text Text to Identify Above Link? Correct?
Yes, I need to get both the href from the first class and the text from the second class of Text to Identify Above Link, otherwise I dont know what the link is for. I tried find_elements_by_xpath("//td[@class='data']") but that just gets me the same output as what I originally had.
In the example HTML you provided the second td with the Text to Identify Above Link text has no a with href inside it.
Because there is no href in the second data class. I need to get the first and second data class consecutively in a list of elements and then when I have identified the link from the text in the second data class, I want to extract the href from the first data class, if that makes sense.
|
0

I got it to work using a solution posted here:

 ls_class = driver.find_elements_by_xpath("//td[@class='data']")

 for clas in ls_class:
     print(clas.text)
     try:
         print(clas.find_element_by_css_selector('a').get_attribute('href'))
     except:
         print("No Link")

Now my output is:

Scraped Class
https://somesite.com/file_to_scrape.pdf
Text to Identify Above Link
No Link
Not Relevant Text
No Link

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.