Webscraping with Selenium in Python

Question

I am trying to webscrape the list of DAOs from masari.io but I am having trouble because I get the following errors:

DeprecationWarning: executable_path has been deprecated, please pass in a Service object


driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

DevTools listening on ws://127.0.0.1:56691/devtools/browser/b4609671-5e6e-4d25-b09e-4116b3dde4bf
[0525/100030.252:INFO:CONSOLE(1)] "enabling sentry error tracker", source: https://messari.io/static/js/main.977a4794.chunk.js (1)
[0525/100030.951:INFO:CONSOLE(2)] "Unable to refresh token: Login required", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.065:INFO:CONSOLE(2)] "


88b           d88                                                            88
888b         d888                                                            ""
88'8b       d8'88
88 '8b     d8' 88   ,adPPYba,  ,adPPYba,  ,adPPYba,  ,adPPYYba,  8b,dPPYba,  88
88  '8b   d8'  88  a8P_____88  I8[    ""  I8[    ""  ""     'Y8  88P'   "Y8  88
88   '8b d8'   88  8PP"""""""   '"Y8ba,    '"Y8ba,   ,adPPPPP88  88          88
88    '888'    88  "8b,   ,aa  aa    ]8I  aa    ]8I  88,    ,88  88          88
88     '8'     88   '"Ybbd8"'  '"YbbdP"'  '"YbbdP"'  '"8bbdP"Y8  88          88


", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.069:INFO:CONSOLE(2)] "Interested in a CHALLENGE? Check out: https://messari.io/quiz", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
Traceback (most recent call last):
  File "c:/Users/Student/webScrape/scraper.py", line 21, in <module>
    matches = WebDriverWait(driver, 10).until(
  File "C:\Users\Student\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\support\wait.py", line 89, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
Backtrace:
        Ordinal0 [0x0096B8F3+2406643]
        Ordinal0 [0x008FAF31+1945393]
        Ordinal0 [0x007EC748+837448]
        Ordinal0 [0x008192E0+1020640]
        Ordinal0 [0x0081957B+1021307]
        Ordinal0 [0x00846372+1205106]
        Ordinal0 [0x008342C4+1131204]
        Ordinal0 [0x00844682+1197698]
        Ordinal0 [0x00834096+1130646]
        Ordinal0 [0x0080E636+976438]
        Ordinal0 [0x0080F546+980294]
        GetHandleVerifier [0x00BD9612+2498066]
        GetHandleVerifier [0x00BCC920+2445600]
        GetHandleVerifier [0x00A04F2A+579370]
        GetHandleVerifier [0x00A03D36+574774]
        Ordinal0 [0x00901C0B+1973259]
        Ordinal0 [0x00906688+1992328]
        Ordinal0 [0x00906775+1992565]
        Ordinal0 [0x0090F8D1+2029777]
        BaseThreadInitThunk [0x777BFA29+25]
        RtlGetAppContainerNamedObjectPath [0x77B77A7E+286]
        RtlGetAppContainerNamedObjectPath [0x77B77A4E+238]

I know there is an API for messari.io, but I am almost certain it is only for their assets and not their list of DAOs. I tried using Selenium since it is a dynamic page but I am still having trouble. Here is my code:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests

url = 'https://messari.io/governor/daos'

DRIVER_PATH = 'PATH_TO_DRIVER_ON_MY_PC'
options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")

# s = Service('PATH_TO_DRIVER_ON_MY_PC')
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get('https://messari.io/governor/daos')

try:
    matches = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.TAG_NAME, "td")))
    # for match in matches:
    #     print(match.text)

finally:
    driver.quit()

Update I fixed the executable_path warning, but I am still getting the same TimeoutException error. And when I run it without headless I also get the following message:

DevTools listening on ws://127.0.0.1:57773/devtools/browser/4450b78d-3a9f-401a-b39c-2c716ecad924
[9628:20616:0525/102300.840:ERROR:device_event_log_impl.cc(214)] [10:23:00.840] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
[9628:20616:0525/102300.841:ERROR:device_event_log_impl.cc(214)] [10:23:00.841] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)

I assume this part is more of a hardware message that I shouldn't worry about based on similar questions bc when I unplugged my mouse it removed one of them.

it seems it can't find this element. First you could display HTML (driver.page_source) to manually check if there is this element. And if this element is inside <frame> then you have to use driver.switch_to before you try to search it. — furas
– furas, Commented May 25, 2022 at 14:31
I check source code in DevTools and I don't see any <td> in code. It uses only <div> to create something like table. What do you really want to get? — furas
– furas, Commented May 25, 2022 at 14:34
I want to be able to get each element in the table, except the last column. For example in the first row: the name: Fei, the type: protocol, the tags: defi — Alex
– Alex, Commented May 25, 2022 at 14:36
as I said it DOESN'T use <td> to display it but <div> and it keeps Fei in <h4> - at least in my Firefox on desktop system. — furas
– furas, Commented May 25, 2022 at 14:37

furas · Accepted Answer · 2022-05-25 16:08:39Z

2

This page doesn't use <td> to display list of DAOs.
It uses <div> (with CSS) to display it similar to table.

And it keeps name of DAO in <h4>

At least it uses and in my Firefox on laptop with Linux.

Full working code (tested on Linux Mint, Python 3.8, Selenium 4.x, Chrome 101.x)

I used module webdriver_manager so it automatically downloads fresh driver when Linux installs newer version of Chrome

I have to use find_elements() (with s in word elements) or presence_of_all_elements_located() to get all <h4>.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from webdriver_manager.chrome import ChromeDriverManager

url = 'https://messari.io/governor/daos'

options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")

driver = webdriver.Chrome(options=options, service=Service(ChromeDriverManager().install()))

driver.get('https://messari.io/governor/daos')

try:
    matches = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.TAG_NAME, "h4")))
    
    #matches = driver.find_elements(By.TAG_NAME, "h4")
    
    for match in matches:
        if match.text:
            print(match.text)
finally:
    driver.quit()

Result:

Fei
Rook
Cosmos
Stargate Finance
Aave
Treasure DAO
DODO
Radicle
Goldfinch
Merit Circle
EPNS
Perpetual Protocol
Gitcoin
SuperRare
Indexed
Doodles
Rome DAO
Badger
Paraswap
Unlock
Terra
Shapeshift
Lobis
Pool Together
The Graph
Yearn Finance
Ampleforth
Alpaca Finance
Balancer
Gro Protocol
Sismo DAO
BeethovenX
ENS
Lido
Alchemist

EDIT:

TO get all values you may have to scroll page - and JavaScript will add new items.

There are answers which use while-loop with execute_script() which use JavaScript code to scroll to the bottom and get current height. If height is different than before scroll then you have to scroll again, but if height is the same then you have end of page and now you can get all items.

edited May 25, 2022 at 16:08

answered May 25, 2022 at 14:39

furas

149k12 gold badges121 silver badges171 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Alex Over a year ago

Okay, so based on that, how would I change up my code to operate as I want it to. I tried to just change tag to "h4" from "td" and print matches.text but it only returned "Govenor" and when I tried to change presence_of_element_located to presence_of_all_elements_located it only returned "Governor" as wel

furas Over a year ago

I added full working code.

Alex Over a year ago

That works, thank you so much! What would I need to do differently to get all 855 DAOs versus just the 70 DAOs that this returns?

furas Over a year ago

code would have to behave like real human - you would have to scroll to the bottom of page, wait and check if page is not bigger then in previous scroll - if it bigger then scroll again, if not bigger then it is end of page. There are questions which shows how to use while-loop with JavaScript (execute_script()) for this

Alex Over a year ago

How did you figure out that it keeps the names of DAOs in h4? When I do inspect, it says the names are a <span>. I am trying to figure out how to get the Type and Tags of each DAO as well.

|

General Grievance · Accepted Answer · 2022-05-25 20:05:48Z

-1

With selenium4 as the key executable_path is deprecated you have to use an instance of the Service() class along with ChromeDriverManager().install() command as discussed below

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://www.google.com")

edited May 25, 2022 at 20:05

General Grievance

5,12039 gold badges39 silver badges60 bronze badges

answered May 25, 2022 at 14:16

Akzy

1,8891 gold badge11 silver badges21 bronze badges

4 Comments

Alex Over a year ago

that fixed my first error message, but I am still getting a TimeoutException message similar to the one in my original question.

furas Over a year ago

this is NOT error but only warning and OP can still use old method - and new method doesn't resolve main problem.

Akzy Over a year ago

I can see you have updated the question and i posted the answer before. So if this has fixed your actual issue, so could you please accept the answer and create the separate question for other issue

Alex Over a year ago

This just fixed the warning I was getting, but has not fixed the actual issue I am running into

Collectives™ on Stack Overflow

Webscraping with Selenium in Python

2 Answers 2

9 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related