Parse a table with BeautifulSoup, Selenium in Python

Question

https://rocketreach.co/horizon-blue-cross-blue-shield-of-new-jersey-email-format_b5c604a3f42e0c54 This is the link I'm trying to get the information out of. I need to extract the formats that's in the table "first '_' last" "first_initial last" and so on. If not all of them, then at least the top most format.

Here's what I have so far:

def search_on_google(key_word, driver):
    driver.get("https://www.google.com/")
    searchBoard = driver.find_element_by_name('q')
    searchBoard.send_keys(key_word + " Rocketreach.co")
    searchBoard.send_keys(Keys.TAB)
    searchBoard.send_keys(Keys.ENTER)
    driver.find_element_by_tag_name("cite").click()
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for link in soup.find_all('meta'):
        content = link.get('content')
        print(content)

Edit:

    for i in range(1):
    driver.find_element_by_tag_name("cite").click()
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.XPATH, "//table/tbody/tr[1]/td[1][not(contains(text(), 'Lorem ipsum...'))]")))

    table_id = driver.find_element(By.TAG_NAME, "tbody")
    rows = table_id.find_elements(By.TAG_NAME, "tr")
    for row in rows:
        tds = row.find_elements(By.TAG_NAME, "td")
        top_format.append(tds[0].text)
        domain.append(tds[1].text)
        print(top_format)
        print(domain)
        break

    return top_format

What format are you talking about? What do you need to extract? — Karthik
– Karthik, Commented Sep 15, 2020 at 4:55

Arundeep Chohan · Accepted Answer · 2020-09-17 20:26:37Z

1

There's only one table on this page to print all the information you can simply do the following to print all the information. It is also not in any iframes.

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//table/tbody/tr[1]/td[1][not(contains(text(), 'Lorem ipsum...'))]")))
table_id = driver.find_element(By.TAG_NAME, "tbody")
rows = table_id.find_elements(By.TAG_NAME, "tr")
for row in rows:
    tds = row.find_elements(By.TAG_NAME, "td")
    for td in tds:
       one_urls.append(td.text)
print(one_urls)

You could do a check before the print or you could do a range.

if tds[0] =='':

I'd also suggest a wait prior to finding the table since your clicking and loading a new page prior to getting the table.

table_id= WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "tbody")))

Import these

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

edited Sep 17, 2020 at 20:26

answered Sep 15, 2020 at 5:15

Arundeep Chohan

9,9895 gold badges17 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

confusedcoder Over a year ago

even after waiting to find the table it still prints ['Lorem ipsum...', 'Lorem ipsum...', 'Lorem ipsum...', 'Lorem ipsum...', 'Lorem ipsum...',] how do i overcome this

Arundeep Chohan Over a year ago

Add some implicit_waits it seems to take a while to load the data.

Arundeep Chohan Over a year ago

Or wait till //table/tbody/tr[1]/td[1]/text()="first '_' last" becomes visible.

confusedcoder Over a year ago

where would I write that? and what's the exact syntax? I'm sorry I'm a little confused

Arundeep Chohan Over a year ago

After you driver.get() the page takes a while to load the right data.

|

Collectives™ on Stack Overflow

Parse a table with BeautifulSoup, Selenium in Python

1 Answer 1

16 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

16 Comments

Your Answer

Sign up or log in

Post as a guest

Related