3

I want to download a webpage using python for some web scraping task. The problem is that the website requires cookies to be enabled, otherwise it serves different version of a page. I did implement a solution that solves the problem, but it is inefficient in my opinion. Need your help to improve it!

This is how I go over it now:

import requests
import cookielib

cj = cookielib.CookieJar()
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
#first request to get the cookies
requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
# second request reusing cookies served first time
r = requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
html_text = r.text

Basically, I create a CookieJar object and then send two consecutive requests for the same URL. First time it serves me the bad page but as compensation gives cookies. Second request reuses this cookie and I get the right page.

The question is: Is it possible to just use one request and still get the right cookie enabled version of a page?

I tried to send HEAD request first time instead of GET to minimize traffic, in this case cookies aren't served. Googling for it didn't give me the answer either. So, it is interesting to understand how to make it efficiently! Any ideas?!

4
  • you don't need to manually work with CookieJar starting from requests 0.6.0 kennethreitz.com/requests-v060-released.html#dict-cookies Commented Nov 19, 2012 at 2:06
  • Yeah @yonilevy good catch! Will use it that way now. Commented Nov 19, 2012 at 2:25
  • link is broken @yonilevy Commented Jul 30, 2013 at 14:26
  • @goldisfine thanks, here's another one: stackoverflow.com/a/7164897/145823 Commented Jul 30, 2013 at 15:53

2 Answers 2

2

You need to make the request to get the cookie, so no, you cannot obtain the cookie and reuse it without making two separate requests. If by "cookie-enabled" you mean the version that recognizes your script as having cookies, then it all depends on the server and you could try:

  • hardcoding the cookies before making first request,
  • requesting some smallest possible page (with smallest possible response yet containing cookies) to obtain first cookie,
  • trying to find some walkaroung (maybe adding some GET argument will fool the site into believing you have cookies - but you would need to find it for this specific site),
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks @Tadeck! I actually don't know the pages in advance and cannot predict what will be the behavior on their side (with or w/o cookies). So, in this case taking into account your comment I think 2 requests are required. BTW, by cookie-enabled I mean that in order to serve the right page their server asks for cookies. When I load the page listed in the example in browser it seems that server exchanges several messages with me before I see the right page.
Also, may be there is a way to at least do this 2 sequential requests not for all pages in my DB?! Say some pages serve the page from the beginning, but sometimes I encounter this problem. Is there a way to judge whether the page is surrogate or not from the first request? I guess now, still what do you think?!
@Nick: It looks like they do not want the page to be scraped, thus do not make it easily identifiable. I think there is no universal way of identifying such cases for several different sites. In this specific case you can try to identify differences - eg. the first response has "respondwithsignonpage" header set to "true", which you could use for checks. However, this is non-standard HTTP header and you will most likely not find it on other sites.
Thank you, @Tadeck! I agree with you. I am already comparing differences between files served just for fun to see what is the percentage of such cases. Don't think that these are abundant.
2

I think the winner here might be to use requests's session framework, which takes care of the cookies for you.

That would look something like this:

import requests
import cookielib

user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
s = requests.session(headers=user_agent, timeout=2)

r = s.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&')
html_text = r.text

Try that and see if that works?

2 Comments

No, @jdotjdot, it didn't work. The reason is session also needs to have the first interaction to update the cookies. Still two requests are needed in this case. Thanks for the effort though!
Yeah, I even tried again using s.head(...), and that didn't work either. Kind of an odd issue.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.