7

I'm requesting a website whose response is a JSON like this:

{
    "success": true,
    "response": "<html>... html goes here ...</html>"
}

I've seen both ways to scrap HTML or JSON, but haven't found how to scrap HTML inside a JSON. Is it possible to do this using scrapy?

3 Answers 3

15

One way is to build a scrapy.Selector out of the HTML inside the JSON data.

I'll assume you have the Response object with JSON data in it, available through response.text.

(Below, I'm building a test response to play with (I'm using scrapy 1.1 with Python 3):

response = scrapy.http.TextResponse(url='http://www.example.com/json', body=r'''
{
    "success": true,
    "response": "<html>\n <head>\n  <base href='http://example.com/' />\n  <title>Example website</title>\n </head>\n <body>\n  <div id='images'>\n   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n  </div>\n </body>\n</html>"
}
''', encoding='utf8')

)

Using json module you can get the HTML data like this:

import json
data = json.loads(response.text)

You get something like :

>>> data
{'success': True, 'response': "<html>\n <head>\n  <base href='http://example.com/' />\n  <title>Example website</title>\n </head>\n <body>\n  <div id='images'>\n   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n  </div>\n </body>\n</html>"}

Then you can build a new selector like this:

selector = scrapy.Selector(text=data['response'], type="html")

after which you can use XPath or CSS selectors on it:

>>> selector.xpath('//title/text()').extract()
['Example website']
Sign up to request clarification or add additional context in comments.

Comments

1

Well, there's another way that you definitely do not need to construct a response object.You can use lxml to parse your html text. You don't need to install any new lib , since Scrapy Selector is based on lxml. Just add the code below to import lxml lib.

from lxml import etree

Here is an exmaple, assuming that the json response is:

{
    "success": true,
    "htmlinjson": "<html><body> <p id='p1'>p111111</p> <p id='p2'>p22222</p> </html>"
}

Extract the html text from the json response by:

import json

htmlText = json.loads(response.text)['htmlinjson']

Then construct a lxml xpath selcector using:

from lxml import etree

resultPage = etree.HTML(htmlText)

Now use the lxml selector to extract text of the node

with id="p1", basing on xpath just like how scrapy xpath selector do:

print resultPage.xpath('//p[@id="p1"]')[0].text

You will get:

p111111

Hope that helps :)

Comments

0

You can try json.loads(initial_response) , so you get a dict and can use his keys, like ['response']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.