0

I called the following code to visit a url and tried to print the content on that page:

import urllib2
f = urllib2.urlopen("https://www.reaxys.com/reaxys/secured/customset.do?performed=true&action=get_preparations&searchParam=1287039&workflowId=1338317532514&workflowStep=1&clientDateTime=2012-05-29%2015:17")
page = f.read()
print page
f.close()

I'm not sure if the url is accessible everywhere, so the content on that page might not be accessible to everyone.

This page sets a time constraints on how long a user can stay on the page, and after that time, a popup would show up saying the user has reached the timeout.

Here's the problem I bumped into: When I typed the url into a browser, everything opened just fine. But when I tried printing what Python read from that page, Python read the page that would only pop out when the page has reached a timeout.

I don't know what's wrong, is it Python or the website? How can I make Python read the actual content on that page?

Thanks in advance.

1 Answer 1

1

It appears to be related to cookies being set by the website. If I visit the URL

https://www.reaxys.com/reaxys/secured/customset.do?performed=true&action=get_preparations&searchParam=1287039&workflowId=1338317532514&workflowStep=1

in my browser, I get the same timeout error. If I refresh, the site loads fine. But if I clear my cookies from the site and retry, I get the timeout again. So, I suspect that the site is executed some process that adds a timestamp and checks it before the page is visible, and defaults to a timeout if for some reason the cookie can't be set (as would be the case with a visit from within a Python script).

I would suggest doing an in-depth investigation of the cookies being set (start with the Javascript on that page, which seems to be handling some of the timeout logic), and then try setting cookies from the scraping process as per: http://www.testingreflections.com/node/view/5919 , http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/ , or the like.

(This is in no way intended to condone the scraping of an Elsevier site, as they may come after you and eat your young :) )

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.