1

Very new to Python and Selenium, looking to scrape a few data points. I'm struggling in three areas:

  1. I don't understand how to loop through multiple URLs properly
  2. I can't figure out why the script is iterating twice over each URL
  3. I can't figure out why it's only outputting the data for the second URL

Much thanks for taking a look!

Here's my current script:

urls = [
    'https://developers.google.com/speed/pagespeed/insights/?url=https://www.crutchfield.com/%2F&tab=mobile',
    'https://developers.google.com/speed/pagespeed/insights/?url=https://www.lastpass.com%2F&tab=mobile'
]

driver = webdriver.Chrome(executable_path='/Library/Frameworks/Python.framework/Versions/3.9/bin/chromedriver')

for url in urls:
    for page in range(0, 1):
        driver.get(url)
        wait = WebDriverWait(driver, 120).until(EC.presence_of_element_located((By.CLASS_NAME, 'origin-field-data')))
df = pd.DataFrame(columns = ['Title', 'Core Web Vitals', 'FCP', 'FID', 'CLS', 'TTI', 'TBT', 'Total Score'])
company = driver.find_elements_by_class_name("audited-url__link")

data = []

for i in company:
    data.append(i.get_attribute('href'))

for x in data:
    #Get URL name
    title = driver.find_element_by_xpath('//*[@id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[1]/div/div[2]/h1/a')
    co_name = title.text

    #Get Core Web Vitals text pass/fail
    cwv = driver.find_element_by_xpath('//*[@id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[1]/span[2]')
    core_web = cwv.text

    #Get FCP
    fcp = driver.find_element_by_xpath('//*[@id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div')
    first_content = fcp.text

    #Get FID
    fid = driver.find_element_by_xpath('//*[@id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[3]/div[1]/div')
    first_input = fid.text

    #Get CLS
    cls = driver.find_element_by_xpath('//*[@id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[4]/div[1]/div')
    layout_shift = cls.text

    #Get TTI
    tti = driver.find_element_by_xpath('//*[@id="interactive"]/div/div[1]')
    time_interactive = tti.text

    #Get TBT
    tbt = driver.find_element_by_xpath('//*[@id="total-blocking-time"]/div/div[1]')
    total_block = tbt.text

    #Get Total Score
    total_score = driver.find_element_by_xpath('//*[@id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[1]/div/div[1]/a/div[2]')
    score = total_score.text

    #Adding all columns to dataframe
    df.loc[len(df)] = [co_name,core_web,first_content,first_input,layout_shift,time_interactive,total_block,score]
        
driver.close()

#df.to_csv('Double Page Speed Test 9-10.csv')
print(df)
2
  • why this line?: for page in range(0, 1) Commented Sep 10, 2021 at 21:44
  • Yeah, I'm not sure @pcalkins. I was using this because I saw it work for someone else trying to achieve the same purpose. I've now commented it out and the script runs fine, but still printing duplicate results for the second URL (Not printing any result for the first URL) Commented Sep 12, 2021 at 11:20

1 Answer 1

1

Q1 : I don't understand how to loop through multiple URLs properly ?

Ans : for url in urls:

Q2. I can't figure out why the script is iterating twice over each URL

Ans : Cause you have for page in range(0, 1):

Update 1:

I did not run your entire code with DF. Also sometimes either one of the pages, does not show the number and href, but when I typically run the below code,

driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(50)
wait = WebDriverWait(driver, 20)
urls = [
    'https://developers.google.com/speed/pagespeed/insights/?url=https://www.crutchfield.com/%2F&tab=mobile',
    'https://developers.google.com/speed/pagespeed/insights/?url=https://www.lastpass.com%2F&tab=mobile'
]

data = []

for url in urls:
    driver.get(url)
    wait = WebDriverWait(driver, 120).until(EC.presence_of_element_located((By.CLASS_NAME, 'origin-field-data')))
    company = driver.find_elements_by_css_selector("h1.audited-url a")
    for i in company:
        data.append(i.get_attribute('href'))

print(data)

this output :

['https://www.crutchfield.com//', 'https://www.lastpass.com/', 'https://www.lastpass.com/']

which is true case the element locator that we have used is representing 1 element on page 1 or 2 element on page 2

Sign up to request clarification or add additional context in comments.

2 Comments

I'm afraid I'm too ignorant to understand how this helps me. I've adjusted the script to only comment out for page in range(0, 1): and adjusted subsequent indentation. Script runs fine, but I'm still duplicating results and only for the second URL. I'm not getting any printed results for the first URL scrape.
@Micadeli : See above !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.