Processing JSON Response using scrapy

Question

I have the following code in my scrapy spider:

def parse(self, response):
         jsonresponse = json.loads(response.body_as_unicode())
         htmldata = jsonresponse["html"]
         for sel in htmldata.xpath('//li/li'):
                 -- more xpath codes --
         yield item

But i am having this error:

    raise ValueError("No JSON object could be decoded")
exceptions.ValueError: No JSON object could be decoded

After checking the json reply, i found out about **** and **** which is causing this error.

<!--WPJM-->{"found_jobs":true,"html":"<html code>","max_num_pages":3}<!--WPJM_END-->

How do i parse my scrapy without looking at the !--WPJM-- and !--WPJM_END-- code?

EDIT: This is the error that i have:

File "/home/muhammad/Projects/project/project/spiders/crawler.py", line 150, in parse for sel in htmldata.xpath('//li'): exceptions.AttributeError: 'unicode' object has no attribute 'xpath'

    def parse(self, response):
        rawdata = response.body_as_unicode()
        jsondata = rawdata.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
#       print jsondata # For debugging
#       pass 
        data = json.loads(jsondata)
        htmldata = data["html"]
#       print htmldata # For debugging
#       pass
        for sel in htmldata.xpath('//li'):
           item = ProjectjomkerjaItem()
           item['title'] = sel.xpath('a/div[@class="position"]/div[@id="job-title-job-listing"]/strong/text()').extract()
           item['company'] = sel.xpath('a/div[@class="position"]/div[@class="company"]/strong/text()').extract()
           item['link'] = sel.xpath('a/@href').extract()

Content-Type: application/x-www-form-urlencoded

muhammadn
– muhammadn

2014-12-12 15:17:11 +00:00
Commented Dec 12, 2014 at 15:17 — muhammadn
– muhammadn, Commented Dec 12, 2014 at 15:17

alecxe · Accepted Answer · 2014-12-12 14:58:06Z

1

The easiest approach would be to get rid of the comments tags manually using replace():

data = response.body_as_unicode()
data = data.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
jsonresponse = json.loads(data)

Though it is not quite pythonic and reliable.

Or, a better option would to be to get the text() by xpath:

$ scrapy shell index.html
>>> response.xpath('//text()').extract()[0]
u'{"found_jobs":true,"html":"<html code"}'

edited Dec 12, 2014 at 14:58

answered Dec 12, 2014 at 14:51

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

muhammadn Over a year ago

i hope to know what the second option you had, it was not clear for me (the first one is), i have an ending with class=\"rss_link">RSS<\/a>","max_num_pages":3}

alecxe Over a year ago

@muhammadn the second option is to use text() xpath function to extract the text from the HTML code; in other words, to ignore comments.

muhammadn Over a year ago

I am still having the json decode error. Now i am really unsure why/ (But when i save the json file, remove the WPJM comments and run a json decoder in ruby to html, while running scrapy with file://processedjsonfile.html, it works.)

alecxe Over a year ago

@muhammadn what is the complete error message? and how does the data look after extracting it and getting rid of comments?

muhammadn Over a year ago

File "/home/muhammad/Projects/project/project/spiders/spider.py", line 150, in parse for sel in htmldata.xpath('//li'): exceptions.AttributeError: 'unicode' object has no attribute 'xpath' (also i have edited my question) htmldata produces the html data but for sel it requires the response which i don't know how to do it.

|

Collectives™ on Stack Overflow

Processing JSON Response using scrapy

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related