python scrape webpage and parse the content

Question

I want to scrape the data on this link

http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json

I am not sure what type of this link is, is it html or json or something else. Sorry for my bad web knowledge. But I try to use the following code to scrape:

import requests

url='http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json'
source=requests.get(url).text

The type of the source is unicode. I also try to use the urllib2 to scrape like:

source2=urllib2.urlopen(url).read()

The type of source2 is string. I am not sure which method is better. Because the link is not like the normal webpage contains different tags. If I want to clean the scraped data and form the dataframe data (like the pandas dataframe), what method or process I should follow/

Thanks.

@depperm, thanks for reply. I update the link. It should work now. — Mr_Pi
– Mr_Pi, Commented Nov 10, 2016 at 14:25

narko · Accepted Answer · 2016-11-10 14:55:31Z

0

The returned response is text containing valid JSON data within it. You can validate it on your own using a service like http://jsonlint.com/ if you want. For doing so just copy the code within the brackets

return_json("JSON code to copy")

In order to make use of that data you just need to parse it in your program. Here an example: https://docs.python.org/2/library/json.html

edited Nov 10, 2016 at 14:55

answered Nov 10, 2016 at 14:33

narko

3,9451 gold badge30 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

narko Over a year ago

That's what I wrote. The content inside the brackets is the JSON data that you need. And it is valid. I validated it using the service I pointed out.

OneCricketeer Over a year ago

And I provided a code answer instead of a link. OP shouldnt need to copy that long response manually

narko Over a year ago

I am not saying you need to copy the JSON response manually in your code. I was just trying to show that it is valid JSON. Just extract the JSON data from the response and do what you need in your code. If you need help handling json data from python I suggest you read the official docs: docs.python.org/2/library/json.html

OneCricketeer Over a year ago

I don't need the link. I'm just saying your answer could be better (as in example code along with the link)

Mr_Pi Over a year ago

Thanks for reply. I now can confirm it is the json page.

OneCricketeer · Accepted Answer · 2016-11-12 16:44:15Z

0

The response is text. It does contain JSON, just need to extract it

import json

strip_len = len("return_json(")

source=requests.get(url).text[strip_len:-2]
source = json.loads(source)

edited Nov 12, 2016 at 16:44

answered Nov 10, 2016 at 14:43

OneCricketeer

193k20 gold badges146 silver badges276 bronze badges

3 Comments

Mr_Pi Over a year ago

Thanks for reply. I tried this method before but I don't know I should strip the 'return_json('. One comment, the correct code of 3rd line should be source=requests.get(url).text[strip_len:-2], not -1.

OneCricketeer Over a year ago

I couldn't see the end of the response, but yes, you should strip that as it isn't part of the JSON

OneCricketeer Over a year ago

Basically, that URL is returning something that is meant to be queried by javascript, not python. stackoverflow.com/a/7613857/2308683

Collectives™ on Stack Overflow

python scrape webpage and parse the content

2 Answers 2

5 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related