1

I want to scrape the data on this link

http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json

I am not sure what type of this link is, is it html or json or something else. Sorry for my bad web knowledge. But I try to use the following code to scrape:

import requests

url='http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json'
source=requests.get(url).text

The type of the source is unicode. I also try to use the urllib2 to scrape like:

source2=urllib2.urlopen(url).read()

The type of source2 is string. I am not sure which method is better. Because the link is not like the normal webpage contains different tags. If I want to clean the scraped data and form the dataframe data (like the pandas dataframe), what method or process I should follow/

Thanks.

1
  • @depperm, thanks for reply. I update the link. It should work now. Commented Nov 10, 2016 at 14:25

2 Answers 2

0

The returned response is text containing valid JSON data within it. You can validate it on your own using a service like http://jsonlint.com/ if you want. For doing so just copy the code within the brackets

return_json("JSON code to copy")

In order to make use of that data you just need to parse it in your program. Here an example: https://docs.python.org/2/library/json.html

Sign up to request clarification or add additional context in comments.

5 Comments

That's what I wrote. The content inside the brackets is the JSON data that you need. And it is valid. I validated it using the service I pointed out.
And I provided a code answer instead of a link. OP shouldnt need to copy that long response manually
I am not saying you need to copy the JSON response manually in your code. I was just trying to show that it is valid JSON. Just extract the JSON data from the response and do what you need in your code. If you need help handling json data from python I suggest you read the official docs: docs.python.org/2/library/json.html
I don't need the link. I'm just saying your answer could be better (as in example code along with the link)
Thanks for reply. I now can confirm it is the json page.
0

The response is text. It does contain JSON, just need to extract it

import json

strip_len = len("return_json(")

source=requests.get(url).text[strip_len:-2]
source = json.loads(source) 

3 Comments

Thanks for reply. I tried this method before but I don't know I should strip the 'return_json('. One comment, the correct code of 3rd line should be source=requests.get(url).text[strip_len:-2], not -1.
I couldn't see the end of the response, but yes, you should strip that as it isn't part of the JSON
Basically, that URL is returning something that is meant to be queried by javascript, not python. stackoverflow.com/a/7613857/2308683

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.