Handling Cookies while scraping with Python

Question

I'm trying to scrape the links from the careers page on a college website, and I am getting this error.

urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Moved Temporarily

I think this is because the site has a session cookie. After doing a bit of reading, there seems to be many ways to get around this (Requests, http.cookiejar, Selenium/PhantomJs), but I don't know how to incorporate these solutions into my scraping program.

This is my scraping program. It's written in Python 3.6 with BeautifulSoup4.

from bs4 import BeautifulSoup
from urllib.request import urlopen

html = urlopen("https://jobs.fanshawec.ca/applicants/jsp/shared/search/SearchResults_css.jsp")
soup = BeautifulSoup(html, 'html.parser')
data = soup.select(".ft0 a")
ads = []

for i in data:
    link = i.get('href')
    ads.append(link)

for job in ads:
    print(job)
    print('')

When I clear the cookies in my browser and manually go to the page I'm trying to scrape (https://jobs.fanshawec.ca/applicants/jsp/shared/search/SearchResults_css.jsp), I'm taken to a different page. Once I have the cookie though, I can go directly to the SearchResults page that I want to scrape.

This is the cookie:

Any thoughts on how I can deal with this cookie?

It might also require javascript, in which case you would need to read the HTML via something like selenium. — Martin Evans
– Martin Evans, Commented Mar 21, 2017 at 17:10

bennr01 · Accepted Answer · 2017-03-21 17:16:21Z

1

Using the requests-module:

from bs4 import BeautifulSoup
import requests

session = requests.Session()
req = session.get("https://jobs.fanshawec.ca/applicants/jsp/shared/search/SearchResults_css.jsp")
req.raise_for_status()  # omit this if you dont want an exception on a non-200 response
html = req.text
soup = BeautifulSoup(html, 'html.parser')
data = soup.select(".ft0 a")
ads = []

for i in data:
    link = i.get('href')
    ads.append(link)

for job in ads:
    print(job)
    print('')

However, i am not getting any output, which is probably due to ads being empty. I hope this helps you,

answered Mar 21, 2017 at 17:16

bennr01

316 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Martin Evans · Accepted Answer · 2017-03-21 17:46:56Z

0

The website you are trying to access is probably testing for both cookies and Javascript to be present. Python does provide a CookieJar library but this will not be enough if javascript is also mandatory.

Instead you could use Selenium to get the HTML. It is a bit like a remote control for an existing browser, and can be used as follows:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

url = "https://jobs.fanshawec.ca/applicants/Central?delegateParameter=searchDelegate&actionParameter=showSearch&searchType=8192"

browser = webdriver.Firefox(firefox_binary=FirefoxBinary())
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')

data = soup.select(".ft0 a")
ads = []

for i in data:
    link = i.get('href')
    ads.append(link)

for job in ads:
    print(job)

(Also look at PhantomJS for a headless solution)

Which would give you your links starting as follows:

/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174604&c=%2BWIX1RV817HeJUg7cnxxnQ%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174585&c=4E7TSRVJx7jLG39iR7HvMw%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174563&c=EyCIe7a8xt0a%2BLp4xqtzaw%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174566&c=coZCMU3091mmz%2BE7p%2BHNIg%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants

Note: To use Selenium, you will need to install it, as it is not part of the default Python libraries:

pip install selenium

edited Mar 21, 2017 at 17:46

answered Mar 21, 2017 at 17:17

Martin Evans

46.9k17 gold badges88 silver badges104 bronze badges

6 Comments

nasan Over a year ago

I ran your code, and I got these errors: FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver' During handling of the above exception, another exception occurred: 'geckodriver'>selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

nasan Over a year ago

I downloaded geckodriver, and I'm still getting the error.

Martin Evans Over a year ago

Mine happens to be in my Python folder C:\Python27\geckodriver.exe

nasan Over a year ago

Do I have to put the path in the code? I tried adding this to no success: binary = FirefoxBinary('I put the path in here') browser = webdriver.Firefox(firefox_binary=binary)

nasan Over a year ago

I'll spend some more time with the Selenium docs. Thanks for the help.

|

Collectives™ on Stack Overflow

Handling Cookies while scraping with Python

2 Answers 2

Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related