Selenium Python script only scrapes part of the visible information

Question

I am sorry for the title to better describe the problem when you visit the following website :

There is a text on the right that says "See all". Once you click on that a list of links to various forks pops up. I am trying to scrape the hyperlinks for those forks.

One problem is that the scraper not only scrapes the link for the forks but also for the profiles. They don't use specific class nor ID for those links. So I've edited my script to calculate which result is the right one and which is not. That part works. However the script only scrapes a few links and doesn't scrape others. This confused me because at first I thought that this is caused by the element not being visible to selenium since there is scroll present. This doesn't seem to be the issue however since other links that are not scraped are normally visible. The script only scrapes the first 5 links and completely skips the rest.

I am now unsure on what to do since there is not error or warning about any possible issue with the code itself.

This is a short part of the code that scrapes the links.

driver.get(url)

wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "button.see-all-forks"))).click()
fork_count = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "span.jsx-3602798114"))).text
forks = wait.until(ec.presence_of_all_elements_located((By.CSS_SELECTOR, "a.jsx-2470659356")))
j = 1
for i, fork in enumerate(forks):
    if j == 1:
        forks[i] = fork.get_attribute("href")
        print(forks[i])
    if j == 3:
        j = 1
    else:
        j += 1

In this case "url" variable is the link I provided above. The loop then skips 3 results after each one because every 4th one is the right one. I tried using XPath to filter out the results using the "contains" fuction however the names vary as the users name them on their own so this to my understanding is the only way to filter out the results.

This is the output that I get.

After which no results ever are printed out and the program gets terminated without errors. What is happening here and what have I missed? I am confused about why Selenium only scrapes five results after which it is terminated.

Edit note - my code explained :

I've set up the if statements to check for every 4th result since it's the right one however the first one is also the right one. If "j!=3" then add 1 to "j" once "j=3" (now appears the result) the code if "j=1" is ran and the right result is printed. So the right result will always be "j=1".

Prophet · Accepted Answer · 2021-05-02 21:21:37Z

1

The problem here is that all the expected conditions you are using here are passed once at least one element is presented.
So

forks = wait.until(ec.presence_of_all_elements_located((By.CSS_SELECTOR, "a.jsx-2470659356")))

catches not all the elements as it literally should rather .. you can never know how much, but at least one.
That's why your forks list is so short.
The simplest way to overcome this is to add some hardcoded sleep after wait.until(ec.presence_of_all_elements_located((By.CSS_SELECTOR, "a.jsx-2470659356"))) and only after that to get the list of the elements.
See this post for more details.

In Java there is an expected condition numberOfElementsToBeMoreThan so that it could be used here with condition to be more than 95 etc, but in Python the list of expected conditions is much shorter and there is no such an option....

edited May 2, 2021 at 21:21

answered May 2, 2021 at 20:58

Prophet

33.5k28 gold badges58 silver badges90 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

541daw35d Over a year ago

I've edited my code and added sleep(5) before the wait for the elements to load up. This worked and I got a huge list of results. However now for some reason about the first 10-15 results have duplicates see this output : For example this link : replit.com/@MAORtk123/Customizable-Discord-Bot-14 Gets printed out twice. After the first 10-15 results all links are unique and appear as intended. Any idea as to what might be causing this once the sleep is present? Thanks for your help though.

Prophet Over a year ago

I think you are using wrong locator. try using the following xpath //div[@aria-modal='true']//*[@class='jsx-2470659356 fork-card']/a or if you prefer css selector [aria-modal='true'] div.jsx-2470659356.fork-card>a This gives exactly 100 results. BTW delay of 1 second will be more than enough, no need to put 5 seconds sleep

541daw35d Over a year ago

Your XPath selector worked perfectly not even requiring my useless filter. I am surprised I couldn't come up with it myself... Anyways the results are now perfect and don't contain any duplicates so thank you once again for helping me with this! I've marked your answer as the right one.

Prophet Over a year ago

I'm happy I could help you! Thanks!

Collectives™ on Stack Overflow

Selenium Python script only scrapes part of the visible information

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related