0

https://rocketreach.co/horizon-blue-cross-blue-shield-of-new-jersey-email-format_b5c604a3f42e0c54 This is the link I'm trying to get the information out of. I need to extract the formats that's in the table "first '_' last" "first_initial last" and so on. If not all of them, then at least the top most format.

Here's what I have so far:

def search_on_google(key_word, driver):
    driver.get("https://www.google.com/")
    searchBoard = driver.find_element_by_name('q')
    searchBoard.send_keys(key_word + " Rocketreach.co")
    searchBoard.send_keys(Keys.TAB)
    searchBoard.send_keys(Keys.ENTER)
    driver.find_element_by_tag_name("cite").click()
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for link in soup.find_all('meta'):
        content = link.get('content')
        print(content)

Edit:

    for i in range(1):
    driver.find_element_by_tag_name("cite").click()
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.XPATH, "//table/tbody/tr[1]/td[1][not(contains(text(), 'Lorem ipsum...'))]")))

    table_id = driver.find_element(By.TAG_NAME, "tbody")
    rows = table_id.find_elements(By.TAG_NAME, "tr")
    for row in rows:
        tds = row.find_elements(By.TAG_NAME, "td")
        top_format.append(tds[0].text)
        domain.append(tds[1].text)
        print(top_format)
        print(domain)
        break

    return top_format
1
  • What format are you talking about? What do you need to extract? Commented Sep 15, 2020 at 4:55

1 Answer 1

1

There's only one table on this page to print all the information you can simply do the following to print all the information. It is also not in any iframes.

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//table/tbody/tr[1]/td[1][not(contains(text(), 'Lorem ipsum...'))]")))
table_id = driver.find_element(By.TAG_NAME, "tbody")
rows = table_id.find_elements(By.TAG_NAME, "tr")
for row in rows:
    tds = row.find_elements(By.TAG_NAME, "td")
    for td in tds:
       one_urls.append(td.text)
print(one_urls)

You could do a check before the print or you could do a range.

if tds[0] =='':

I'd also suggest a wait prior to finding the table since your clicking and loading a new page prior to getting the table.

table_id= WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "tbody")))

Import these

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
Sign up to request clarification or add additional context in comments.

16 Comments

even after waiting to find the table it still prints ['Lorem ipsum...', 'Lorem ipsum...', 'Lorem ipsum...', 'Lorem ipsum...', 'Lorem ipsum...',] how do i overcome this
Add some implicit_waits it seems to take a while to load the data.
Or wait till //table/tbody/tr[1]/td[1]/text()="first '_' last" becomes visible.
where would I write that? and what's the exact syntax? I'm sorry I'm a little confused
After you driver.get() the page takes a while to load the right data.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.