Webscraping using Selenium in Python

Question

I am trying to scrape data from the Sunshine List website (http://www.sunshinelist.ca/) using the BeautifulSoup library and the Selenium package (in order to deal with the 'Next' button on the webpage). I know there are several related posts but I just can't identify where and how I should explicitly ask the driver to wait.

Error: StaleElementReferenceException: Message: The element reference of stale: either the element is no longer attached to the DOM or the page has been refreshed

This is the code I have written:

import numpy as np
import pandas as pd
import requests
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

ffx_bin = FirefoxBinary(r'C:\Users\BhagatM\AppData\Local\Mozilla Firefox\firefox.exe')
ffx_caps = DesiredCapabilities.FIREFOX
ffx_caps['marionette'] = True
driver = webdriver.Firefox(capabilities=ffx_caps,firefox_binary=ffx_bin)
driver.get("http://www.sunshinelist.ca/")
driver.maximize_window()

tablewotags1=[]

while True:
    divs = driver.find_element_by_id('datatable-disclosures')
    divs1=divs.find_elements_by_tag_name('tbody')

    for d1 in divs1:
        div2=d1.find_elements_by_tag_name('tr')
        for d2 in div2:
            tablewotags1.append(d2.text)

    try:
        driver.find_element_by_link_text('Next →').click()
    except NoSuchElementException:
        break

year1=tablewotags1[0::10]
name1=tablewotags1[3::10]
position1=tablewotags1[4::10]
employer1=tablewotags1[1::10]  


df1=pd.DataFrame({'Year':year1,'Name':name1,'Position':position1,'Employer':employer1})
df1.to_csv('Sunshine List-1.csv', index=False)

dmb · Accepted Answer · 2017-10-30 14:01:12Z

0

I think you just need to point to the correct firefox Binary. Also, Which version of Firefox are you using? Looks like it's one of the newer versions, this should do if thats the case.

ffx_bin = FirefoxBinary(r'pathtoyourfirefox')
ffx_caps = DesiredCapabilities.FIREFOX
ffx_caps['marionette'] = True
driver = webdriver.Firefox(capabilities=ffx_caps,firefox_binary=ffx_bin)

Cheers

EDIT: So in order to answer your new enquery, "why is not writting the CVS" you should do so like this:

import csv   # You are missing this import
ls_general_list = []

def csv_for_me(list_to_csv):
    with open(pathtocsv, 'a', newline='') as csvfile:
        sw = csv.writer(csvfile, delimeter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
        for line in list_to_csv:
            for data in line:
                sw.writerow(data)

Then replace this in you code, df=pd.DataFrame({'Year':year,'Name':name,'Position':position,'Employer':employer})

for this one, ls.general_list.append(('Year':year,'Name':name,'Position':position,'Employer':employer))

then do so like this, csv_for_me(ls_general_list)

Please accept the answer if it's satisfactory and now you have a csv

edited Oct 30, 2017 at 14:01

answered Oct 26, 2017 at 16:20

dmb

2962 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Mazahir Bhagat Over a year ago

Your input was very helpful. I have modified my question to a new issue I am facing.

Mazahir Bhagat Over a year ago

The code to create the .csv file seems to be fine. The issue is that the code is scraping around 120k rows of data and I think Python is not able to handle that. The code worked fine when I tried scraping the first 1000 rows of data (the .csv file was created). Any idea how I could split the data into separate .csv files?

Mazahir Bhagat Over a year ago

Also another thing I noticed was that even though the code clicks on the 'next' button, the list 'tablewotags' simply ends up storing the data on the first page, multiple times. I haven't been able to identify the problem but I have a feeling the code within the while loop is the issue.

dmb Over a year ago

Move tablewotags = [] out of the loop and at the end use the function to write to the csv. I think, you only need some basics libraries to achieve what you want in other words Pandas is not necesary.

Mazahir Bhagat Over a year ago

I have moved tablewotags =[], but I am now trying to resolve the StaleElementReferenceException.

|

Collectives™ on Stack Overflow

Webscraping using Selenium in Python

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related