4

I'm trying to scrape from sites after authentication. I was able to take the JSESSIONID cookie from an authenticated browser session and download the correct page using urlopener like below.

import cookielib, urllib2

cj = cookielib.CookieJar()
c1 = cookielib.Cookie(None, "JSESSIONID", SESSIONID, None, None, DOMAIN,
        True, False, "/store",True, False, None, False, None, None, None)
cj.set_cookie(c1)

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
fh = opener.open(url)

But when I use this code for creating scrapy requests (tried both dict cookies and cookiejar), the downloaded page is the non-authenticated version. Anyone know what the problem is?

cookies = [{
    'name': 'JSESSIONID',
    'value': SESSIONID,
    'path': '/store',
    'domain': DOMAIN,
    'secure': False,
}]

request1 = Request(url, cookies=self.cookies, meta={'dont_merge_cookies': False})
request2 = Request(url, meta={'dont_merge_cookies': True, 'cookiejar': cj})
1
  • Did you tried just cookies={'JSESSIONID': SESSIONID}? Commented Nov 30, 2013 at 5:49

1 Answer 1

2

You were able to get the JSESSIONID from your browser.

Why not let Scrapy simulate a user login for you?

Then, I think your JSESSIONID cookie will stick to subsequent requests given that :

  • Scrapy uses a single cookie jar (as opposed to Multiple cookie sessions per spider) for the entire spider lifetime containing all your scraping steps,
  • the COOKIES_ENABLED setting for the cookie middleware defaults to true,
  • dont_merge_cookies defaults to false :

    When some site returns cookies (in a response) those are stored in the cookies for that domain and will be sent again in future requests. That’s the typical behaviour of any regular web browser. However, if, for some reason, you want to avoid merging with existing cookies you can instruct Scrapy to do so by setting the dont_merge_cookies key to True in the Request.meta.

    Example of request without merging cookies:

    request_with_cookies = Request(url="http://www.example.com",
                                   cookies={'currency': 'USD', 'country': 'UY'},
                                   meta={'dont_merge_cookies': True})
    
Sign up to request clarification or add additional context in comments.

1 Comment

I alwas see every login example end here ` # continue scraping with authenticated session...` yet thats the exact last step most people have trouble with. I'm trying to use scrapy and the login is successful, yet my next request is still unauthenticated and fails with 403 error

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.