2

I have the following code in my scrapy spider:

def parse(self, response):
         jsonresponse = json.loads(response.body_as_unicode())
         htmldata = jsonresponse["html"]
         for sel in htmldata.xpath('//li/li'):
                 -- more xpath codes --
         yield item

But i am having this error:

    raise ValueError("No JSON object could be decoded")
exceptions.ValueError: No JSON object could be decoded

After checking the json reply, i found out about **<!--WPJM-->** and **<!--WPJM_END-->** which is causing this error.

<!--WPJM-->{"found_jobs":true,"html":"<html code>","max_num_pages":3}<!--WPJM_END-->

How do i parse my scrapy without looking at the !--WPJM-- and !--WPJM_END-- code?

EDIT: This is the error that i have:

File "/home/muhammad/Projects/project/project/spiders/crawler.py", line 150, in parse for sel in htmldata.xpath('//li'): exceptions.AttributeError: 'unicode' object has no attribute 'xpath'

    def parse(self, response):
        rawdata = response.body_as_unicode()
        jsondata = rawdata.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
#       print jsondata # For debugging
#       pass 
        data = json.loads(jsondata)
        htmldata = data["html"]
#       print htmldata # For debugging
#       pass
        for sel in htmldata.xpath('//li'):
           item = ProjectjomkerjaItem()
           item['title'] = sel.xpath('a/div[@class="position"]/div[@id="job-title-job-listing"]/strong/text()').extract()
           item['company'] = sel.xpath('a/div[@class="position"]/div[@class="company"]/strong/text()').extract()
           item['link'] = sel.xpath('a/@href').extract()
1
  • Content-Type: application/x-www-form-urlencoded Commented Dec 12, 2014 at 15:17

1 Answer 1

1

The easiest approach would be to get rid of the comments tags manually using replace():

data = response.body_as_unicode()
data = data.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
jsonresponse = json.loads(data)

Though it is not quite pythonic and reliable.

Or, a better option would to be to get the text() by xpath:

$ scrapy shell index.html
>>> response.xpath('//text()').extract()[0]
u'{"found_jobs":true,"html":"<html code"}'
Sign up to request clarification or add additional context in comments.

6 Comments

i hope to know what the second option you had, it was not clear for me (the first one is), i have an ending with class=\"rss_link">RSS<\/a>","max_num_pages":3}<!--WPJM_END-->
@muhammadn the second option is to use text() xpath function to extract the text from the HTML code; in other words, to ignore comments.
I am still having the json decode error. Now i am really unsure why/ (But when i save the json file, remove the WPJM comments and run a json decoder in ruby to html, while running scrapy with file://processedjsonfile.html, it works.)
@muhammadn what is the complete error message? and how does the data look after extracting it and getting rid of comments?
File "/home/muhammad/Projects/project/project/spiders/spider.py", line 150, in parse for sel in htmldata.xpath('//li'): exceptions.AttributeError: 'unicode' object has no attribute 'xpath' (also i have edited my question) htmldata produces the html data but for sel it requires the response which i don't know how to do it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.