Looping and stop duplicating output | Selenium | Python

Question

Very new to Python and Selenium, looking to scrape a few data points. I'm struggling in three areas:

I don't understand how to loop through multiple URLs properly
I can't figure out why the script is iterating twice over each URL
I can't figure out why it's only outputting the data for the second URL

Much thanks for taking a look!

Here's my current script:

urls = [
    'https://developers.google.com/speed/pagespeed/insights/?url=https://www.crutchfield.com/%2F&tab=mobile',
    'https://developers.google.com/speed/pagespeed/insights/?url=https://www.lastpass.com%2F&tab=mobile'
]

driver = webdriver.Chrome(executable_path='/Library/Frameworks/Python.framework/Versions/3.9/bin/chromedriver')

for url in urls:
    for page in range(0, 1):
        driver.get(url)
        wait = WebDriverWait(driver, 120).until(EC.presence_of_element_located((By.CLASS_NAME, 'origin-field-data')))
df = pd.DataFrame(columns = ['Title', 'Core Web Vitals', 'FCP', 'FID', 'CLS', 'TTI', 'TBT', 'Total Score'])
company = driver.find_elements_by_class_name("audited-url__link")

data = []

for i in company:
    data.append(i.get_attribute('href'))

for x in data:
    #Get URL name
    title = driver.find_element_by_xpath('//*[@id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[1]/div/div[2]/h1/a')
    co_name = title.text

    #Get Core Web Vitals text pass/fail
    cwv = driver.find_element_by_xpath('//*[@id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[1]/span[2]')
    core_web = cwv.text

    #Get FCP
    fcp = driver.find_element_by_xpath('//*[@id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div')
    first_content = fcp.text

    #Get FID
    fid = driver.find_element_by_xpath('//*[@id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[3]/div[1]/div')
    first_input = fid.text

    #Get CLS
    cls = driver.find_element_by_xpath('//*[@id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[4]/div[1]/div')
    layout_shift = cls.text

    #Get TTI
    tti = driver.find_element_by_xpath('//*[@id="interactive"]/div/div[1]')
    time_interactive = tti.text

    #Get TBT
    tbt = driver.find_element_by_xpath('//*[@id="total-blocking-time"]/div/div[1]')
    total_block = tbt.text

    #Get Total Score
    total_score = driver.find_element_by_xpath('//*[@id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[1]/div/div[1]/a/div[2]')
    score = total_score.text

    #Adding all columns to dataframe
    df.loc[len(df)] = [co_name,core_web,first_content,first_input,layout_shift,time_interactive,total_block,score]
        
driver.close()

#df.to_csv('Double Page Speed Test 9-10.csv')
print(df)

Yeah, I'm not sure @pcalkins. I was using this because I saw it work for someone else trying to achieve the same purpose. I've now commented it out and the script runs fine, but still printing duplicate results for the second URL (Not printing any result for the first URL) — VRapport
– VRapport, Commented Sep 12, 2021 at 11:20

cruisepandey · Accepted Answer · 2021-09-12 15:10:56Z

1

Q1 : I don't understand how to loop through multiple URLs properly ?

Ans : for url in urls:

Q2. I can't figure out why the script is iterating twice over each URL

Ans : Cause you have for page in range(0, 1):

Update 1:

I did not run your entire code with DF. Also sometimes either one of the pages, does not show the number and href, but when I typically run the below code,

driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(50)
wait = WebDriverWait(driver, 20)
urls = [
    'https://developers.google.com/speed/pagespeed/insights/?url=https://www.crutchfield.com/%2F&tab=mobile',
    'https://developers.google.com/speed/pagespeed/insights/?url=https://www.lastpass.com%2F&tab=mobile'
]

data = []

for url in urls:
    driver.get(url)
    wait = WebDriverWait(driver, 120).until(EC.presence_of_element_located((By.CLASS_NAME, 'origin-field-data')))
    company = driver.find_elements_by_css_selector("h1.audited-url a")
    for i in company:
        data.append(i.get_attribute('href'))

print(data)

this output :

['https://www.crutchfield.com//', 'https://www.lastpass.com/', 'https://www.lastpass.com/']

which is true case the element locator that we have used is representing 1 element on page 1 or 2 element on page 2

edited Sep 12, 2021 at 15:10

answered Sep 11, 2021 at 6:00

cruisepandey

29.5k6 gold badges23 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

VRapport Over a year ago

I'm afraid I'm too ignorant to understand how this helps me. I've adjusted the script to only comment out for page in range(0, 1): and adjusted subsequent indentation. Script runs fine, but I'm still duplicating results and only for the second URL. I'm not getting any printed results for the first URL scrape.

cruisepandey Over a year ago

@Micadeli : See above !

Collectives™ on Stack Overflow

Looping and stop duplicating output | Selenium | Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related