Fetch a page with cookies using Python requests library

Question

I'm just studying the requests library(http://docs.python-requests.org/en/latest/), and got a problem on how to fetch a page with cookies using requests.

for example:

url2= 'https://passport.baidu.com'
parsedCookies={'PTOKEN': '412f...', 'BDUSS': 'hnN2...', ...} #Sorry that the cookies value is replaced by ... for instance of privacy
req = requests.get(url2, cookies=parsedCookies)
text=req.text.encode('utf-8','ignore')
f=open('before.html','w')
f.write(text)
f.close()
req.close()

when I use the codes above to fetch the page, it just saves the login page to 'before.html' instead of logined page, it refers that actually I haven't logged in successfully.

But if I use URLlib2 to fetch the page, it works properly as expected.

parsedCookies="PTOKEN=412f...;BDUSS=hnN2...;..." #Different format but same content with the aboved cookies
req = urllib2.Request(url2)
req.add_header('Cookie', parsedCookies)
ret = urllib2.urlopen(req)
f=open('before_urllib2.html','w')
f.write(ret.read())
f.close()
ret.close()

When I use these codes, it saves the logined page in before_urllib2.html.

--

Are there any mistakes in my code? Any reply would be grateful.

Why not use a session here and have requests take care of the cookies for you? — Martijn Pieters
– Martijn Pieters, Commented Sep 29, 2013 at 7:25
And you can use req.content to get the original encoded text. And encoding unicode to UTF8 never needs to use 'ignore', UTF-8 can handle all codepoints. — Martijn Pieters
– Martijn Pieters, Commented Sep 29, 2013 at 7:28
Thx to your reply. First, in my code, the cookies is passed from outside, I can only do like this. Second, req.content works well, thx for reminding. But I used to call encode() without 'ignore', but it raised an exception, like "UnicodeEncodeError: 'gbk' codec can't encode character u'\uXXXX' in position XX". So I add 'ignore', do you know why? — Memory
– Memory, Commented Sep 29, 2013 at 9:08
The latest one downloaded from github. Filename is requests-master. — Memory
– Memory, Commented Sep 30, 2013 at 3:06

zhujs · Accepted Answer · 2013-09-29 15:43:27Z

2

You can use Session object to get what you desire:

url2='http://passport.baidu.com'
session = requests.Session()  # create a Session object 
cookie = requests.utils.cookiejar_from_dict(parsedCookies) 
session.cookies.update(cookie) # set the cookies of the Session object

req = session.get(url2, headers=headers,allow_redirects=True)

If you use the requests.get function, it doesn't send cookies for the redirected page. Instead, if you use the Session().get function, it will maintain and send cookies for all http requests, this is what the concept "session" exactly means.

Let me try to elaborate to you what happens here:

When I sent cookies to http://passport.baidu.com/center and set the parameter allow_redirects as false, the returned status code is 302 and one of the headers of the response is 'location': '/center?_t=1380462657' (This is a dynamic value generated by server, you can replace it with what you get from server):

url2= 'http://passport.baidu.com/center'
req = requests.get(url2, cookies=parsedCookies, allow_redirects=False)
print req.status_code # output 302
print req.headers

But when I set the parameter allow_redirects as True, it still doesn't redirect to the page (http://passport.baidu.com/center?_t=1380462657) and the server return the login page. The reason is that the requests.get doesn't send cookies for the redirected page, here is http://passport.baidu.com/center?_t=1380462657, so we can login successfully. That is why we need the Session object.

If I set url2 = http://passport.baidu.com/center?_t=1380462657, it will return the page you want. One solution is use the above code to get the dynamic location value and form a path to you account like http://passport.baidu.com/center?_t=1380462657 , then you can get the desired page.

url2= 'http://passport.baidu.com' + req.headers.get('location')
req = session.get(url2, cookies=parsedCookies, allow_redirects=True )

But this is cumbersome, so when dealing with cookies, Session object do excellent job for us!

edited Sep 29, 2013 at 15:43

answered Sep 29, 2013 at 8:54

zhujs

5533 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Memory Over a year ago

'passport.baidu.com/v2/api/?login' doesn't work, nor 'passport.baidu.com/v2/?login'.

Memory Over a year ago

Actually, urllib2 works well with 'passport.baidu.com' or 'passport.baidu.com/v2/?login'

Memory Over a year ago

No. It still saves the login page.

Memory Over a year ago

Thanks very much to your reply, and I learn a lot. I use session but remove headers from session.get(): req=session.get(url2, allow_redirects=True) . But it still doesn't work properly. Do I need to set any headers?

zhujs Over a year ago

Headers are optional, I just used it as a test originally. I guess you may have not enough cookies. In my case, all my cookies are:'USERNAMETYPE', 'STOKEN', 'SAVEUSERID', 'PTOKEN', 'mkey', 'HOSUPPORT', 'Hm_lvt_90056b3f84f90da57dc0f40150f005d5', 'Hm_lpvt_90056b3f84f90da57dc0f40150f005d5', 'HISTORY_E', 'BDUSS', 'BAIDUID'.

|

Collectives™ on Stack Overflow

Fetch a page with cookies using Python requests library

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related