Fetching cookie enabled page in python

Question

I want to download a webpage using python for some web scraping task. The problem is that the website requires cookies to be enabled, otherwise it serves different version of a page. I did implement a solution that solves the problem, but it is inefficient in my opinion. Need your help to improve it!

This is how I go over it now:

import requests
import cookielib

cj = cookielib.CookieJar()
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
#first request to get the cookies
requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
# second request reusing cookies served first time
r = requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
html_text = r.text

Basically, I create a CookieJar object and then send two consecutive requests for the same URL. First time it serves me the bad page but as compensation gives cookies. Second request reuses this cookie and I get the right page.

The question is: Is it possible to just use one request and still get the right cookie enabled version of a page?

I tried to send HEAD request first time instead of GET to minimize traffic, in this case cookies aren't served. Googling for it didn't give me the answer either. So, it is interesting to understand how to make it efficiently! Any ideas?!

you don't need to manually work with CookieJar starting from requests 0.6.0 kennethreitz.com/requests-v060-released.html#dict-cookies — yonilevy
– yonilevy, Commented Nov 19, 2012 at 2:06
@goldisfine thanks, here's another one: stackoverflow.com/a/7164897/145823 — yonilevy
– yonilevy, Commented Jul 30, 2013 at 15:53

Tadeck · Accepted Answer · 2012-11-19 02:05:01Z

2

You need to make the request to get the cookie, so no, you cannot obtain the cookie and reuse it without making two separate requests. If by "cookie-enabled" you mean the version that recognizes your script as having cookies, then it all depends on the server and you could try:

hardcoding the cookies before making first request,
requesting some smallest possible page (with smallest possible response yet containing cookies) to obtain first cookie,
trying to find some walkaroung (maybe adding some GET argument will fool the site into believing you have cookies - but you would need to find it for this specific site),

answered Nov 19, 2012 at 2:05

Tadeck

138k28 gold badges155 silver badges201 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Nik Over a year ago

Thanks @Tadeck! I actually don't know the pages in advance and cannot predict what will be the behavior on their side (with or w/o cookies). So, in this case taking into account your comment I think 2 requests are required. BTW, by cookie-enabled I mean that in order to serve the right page their server asks for cookies. When I load the page listed in the example in browser it seems that server exchanges several messages with me before I see the right page.

Nik Over a year ago

Also, may be there is a way to at least do this 2 sequential requests not for all pages in my DB?! Say some pages serve the page from the beginning, but sometimes I encounter this problem. Is there a way to judge whether the page is surrogate or not from the first request? I guess now, still what do you think?!

Tadeck Over a year ago

@Nick: It looks like they do not want the page to be scraped, thus do not make it easily identifiable. I think there is no universal way of identifying such cases for several different sites. In this specific case you can try to identify differences - eg. the first response has "respondwithsignonpage" header set to "true", which you could use for checks. However, this is non-standard HTTP header and you will most likely not find it on other sites.

Nik Over a year ago

Thank you, @Tadeck! I agree with you. I am already comparing differences between files served just for fun to see what is the percentage of such cases. Don't think that these are abundant.

jdotjdot · Accepted Answer · 2012-11-19 02:57:24Z

2

I think the winner here might be to use requests's session framework, which takes care of the cookies for you.

That would look something like this:

import requests
import cookielib

user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
s = requests.session(headers=user_agent, timeout=2)

r = s.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&')
html_text = r.text

Try that and see if that works?

answered Nov 19, 2012 at 2:57

jdotjdot

17.3k15 gold badges71 silver badges119 bronze badges

2 Comments

Nik Over a year ago

No, @jdotjdot, it didn't work. The reason is session also needs to have the first interaction to update the cookies. Still two requests are needed in this case. Thanks for the effort though!

jdotjdot Over a year ago

Yeah, I even tried again using s.head(...), and that didn't work either. Kind of an odd issue.

Collectives™ on Stack Overflow

Fetching cookie enabled page in python

2 Answers 2

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related