Scraping HTML inside JSON with Scrapy

Question

I'm requesting a website whose response is a JSON like this:

{
    "success": true,
    "response": "<html>... html goes here ...</html>"
}

I've seen both ways to scrap HTML or JSON, but haven't found how to scrap HTML inside a JSON. Is it possible to do this using scrapy?

paul trmbrth · Accepted Answer · 2016-06-14 09:29:42Z

One way is to build a scrapy.Selector out of the HTML inside the JSON data.

I'll assume you have the Response object with JSON data in it, available through response.text.

(Below, I'm building a test response to play with (I'm using scrapy 1.1 with Python 3):

response = scrapy.http.TextResponse(url='http://www.example.com/json', body=r'''
{
    "success": true,
    "response": "<html>\n <head>\n  <base href='http://example.com/' />\n  <title>Example website</title>\n </head>\n <body>\n  <div id='images'>\n   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n  </div>\n </body>\n</html>"
}
''', encoding='utf8')

)

Using json module you can get the HTML data like this:

import json
data = json.loads(response.text)

You get something like :

>>> data
{'success': True, 'response': "<html>\n <head>\n  <base href='http://example.com/' />\n  <title>Example website</title>\n </head>\n <body>\n  <div id='images'>\n   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n  </div>\n </body>\n</html>"}

Then you can build a new selector like this:

selector = scrapy.Selector(text=data['response'], type="html")

after which you can use XPath or CSS selectors on it:

>>> selector.xpath('//title/text()').extract()
['Example website']

lanx86 · Accepted Answer · 2017-01-01 05:18:50Z

Well, there's another way that you definitely do not need to construct a response object.You can use lxml to parse your html text. You don't need to install any new lib , since Scrapy Selector is based on lxml. Just add the code below to import lxml lib.

from lxml import etree

Here is an exmaple, assuming that the json response is:

{
    "success": true,
    "htmlinjson": "<html><body> <p id='p1'>p111111</p> <p id='p2'>p22222</p> </html>"
}

Extract the html text from the json response by:

import json

htmlText = json.loads(response.text)['htmlinjson']

Then construct a lxml xpath selcector using:

from lxml import etree

resultPage = etree.HTML(htmlText)

Now use the lxml selector to extract text of the node

with id="p1", basing on xpath just like how scrapy xpath selector do:

print resultPage.xpath('//p[@id="p1"]')[0].text

You will get:

p111111

Hope that helps :)

Vinicius de Castro · Accepted Answer · 2016-06-14 18:06:06Z

0

You can try json.loads(initial_response) , so you get a dict and can use his keys, like ['response']

answered Jun 14, 2016 at 18:06

Vinicius de Castro

531 gold badge1 silver badge5 bronze badges

Collectives™ on Stack Overflow

Scraping HTML inside JSON with Scrapy

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest