0

I hacked together the code below to try to scrape data from an HTML table, to a data frame, and then click a button to move to the next page, but it's giving me an error tat says 'invalid selector'.

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from bs4 import BeautifulSoup
import time
from time import sleep
import pandas as pd


browser = webdriver.Chrome("C:/Utility/chromedriver.exe")
wait = WebDriverWait(browser, 10)

url = 'https://healthdata.gov/dataset/Hospital-Detail-Map/tagw-nk32'
browser.get(url)

for x in range(1, 5950, 13):
    time.sleep(3) # wait page open complete
    
    df = pd.read_html(browser.find_element_by_xpath("socrata-table frozen-columns").get_attribute('outerHTML'))[0]
    
    submit_button = browser.find_elements_by_xpath('pager-button-next')[0]
    submit_button.click()

I see the table, but I can't reference it.

enter image description here

Any idea what's wrong here?

1
  • 1
    I don't think that XPath selector is correct. Wouldn't it be something like //div[@class='socrata-table frozen-columns']? Commented Dec 29, 2021 at 18:46

1 Answer 1

1

I've managed to find button with find_elements_by_css_selector

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from bs4 import BeautifulSoup
import time
from time import sleep
import pandas as pd

browser = webdriver.Chrome("C:/Utility/chromedriver.exe")
wait = WebDriverWait(browser, 10)

url = 'https://healthdata.gov/dataset/Hospital-Detail-Map/tagw-nk32'
browser.get(url)

for x in range(1, 5950, 13):
    time.sleep(3)  # wait page open complete

    df = pd.read_html(
        browser.find_element_by_xpath("socrata-table frozen-columns").get_attribute(
            'outerHTML'))[0]

    submit_button = browser.find_elements_by_css_selector('button.pager-button-next')[1]
    submit_button.click()

Sometimes pagination hangs, and submit_button.click() ends with an error

selenium.common.exceptions.ElementClickInterceptedException: 
Message: element click intercepted: 
Element <button class="pager-button-next">...</button> 
is not clickable at point (182, 637). 
Other element would receive the click: <span class="site-name">...</span>

So consider to increase timeout. For example, you can use this approach


def click_timeout(element, timeout: int = 60):
    for i in range(timeout):
        time.sleep(1)
        try:
            element.click()
        except WebDriverException:
            pass
    element.click()

So, you click an element as fast as it will be ready

Sign up to request clarification or add additional context in comments.

1 Comment

I used Paul's recommendation and Eugenij's recommendation, and came up with this. df = pd.read_html( browser.find_element_by_xpath("//div[@class='socrata-table frozen-columns']").get_attribute( 'outerHTML'))[0] submit_button = browser.find_elements_by_css_selector('button.pager-button-next')[1] submit_button.click() However, after running through the loop multiple times, my data frame has a shape of (0, 64), so nothing is being scraped. Thoughts?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.