1

I'm trying to scrape the links from the careers page on a college website, and I am getting this error.

urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Moved Temporarily

I think this is because the site has a session cookie. After doing a bit of reading, there seems to be many ways to get around this (Requests, http.cookiejar, Selenium/PhantomJs), but I don't know how to incorporate these solutions into my scraping program.

This is my scraping program. It's written in Python 3.6 with BeautifulSoup4.

from bs4 import BeautifulSoup
from urllib.request import urlopen

html = urlopen("https://jobs.fanshawec.ca/applicants/jsp/shared/search/SearchResults_css.jsp")
soup = BeautifulSoup(html, 'html.parser')
data = soup.select(".ft0 a")
ads = []

for i in data:
    link = i.get('href')
    ads.append(link)

for job in ads:
    print(job)
    print('')

When I clear the cookies in my browser and manually go to the page I'm trying to scrape (https://jobs.fanshawec.ca/applicants/jsp/shared/search/SearchResults_css.jsp), I'm taken to a different page. Once I have the cookie though, I can go directly to the SearchResults page that I want to scrape.

This is the cookie:

This is the cookie

Any thoughts on how I can deal with this cookie?

1
  • It might also require javascript, in which case you would need to read the HTML via something like selenium. Commented Mar 21, 2017 at 17:10

2 Answers 2

1

Using the requests-module:

from bs4 import BeautifulSoup
import requests

session = requests.Session()
req = session.get("https://jobs.fanshawec.ca/applicants/jsp/shared/search/SearchResults_css.jsp")
req.raise_for_status()  # omit this if you dont want an exception on a non-200 response
html = req.text
soup = BeautifulSoup(html, 'html.parser')
data = soup.select(".ft0 a")
ads = []

for i in data:
    link = i.get('href')
    ads.append(link)

for job in ads:
    print(job)
    print('')

However, i am not getting any output, which is probably due to ads being empty. I hope this helps you,

Sign up to request clarification or add additional context in comments.

Comments

0

The website you are trying to access is probably testing for both cookies and Javascript to be present. Python does provide a CookieJar library but this will not be enough if javascript is also mandatory.

Instead you could use Selenium to get the HTML. It is a bit like a remote control for an existing browser, and can be used as follows:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

url = "https://jobs.fanshawec.ca/applicants/Central?delegateParameter=searchDelegate&actionParameter=showSearch&searchType=8192"

browser = webdriver.Firefox(firefox_binary=FirefoxBinary())
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')

data = soup.select(".ft0 a")
ads = []

for i in data:
    link = i.get('href')
    ads.append(link)

for job in ads:
    print(job)

(Also look at PhantomJS for a headless solution)

Which would give you your links starting as follows:

/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174604&c=%2BWIX1RV817HeJUg7cnxxnQ%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174585&c=4E7TSRVJx7jLG39iR7HvMw%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174563&c=EyCIe7a8xt0a%2BLp4xqtzaw%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174566&c=coZCMU3091mmz%2BE7p%2BHNIg%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants

Note: To use Selenium, you will need to install it, as it is not part of the default Python libraries:

pip install selenium

6 Comments

I ran your code, and I got these errors: FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver' During handling of the above exception, another exception occurred: 'geckodriver'>selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.
I downloaded geckodriver, and I'm still getting the error.
Mine happens to be in my Python folder C:\Python27\geckodriver.exe
Do I have to put the path in the code? I tried adding this to no success: binary = FirefoxBinary('I put the path in here') browser = webdriver.Firefox(firefox_binary=binary)
I'll spend some more time with the Selenium docs. Thanks for the help.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.