Python + Scrapy + JSON + XPath : How to scrape JSON data with Scrapy

Question

I know how to fetch the XPATHs for HTML datapoints with Scrapy. But I have to scrape all the URLs(starting URLs), of this page on this site, which are written in JSON format:

https://highape.com/bangalore/all-events

view-source:https://highape.com/bangalore/all-events

I usually write this in this format:

def parse(self, response):
      events = response.xpath('**What To Write Here?**').extract()

      for event in events:
          absolute_url = response.urljoin(event)
          yield Request(absolute_url, callback = self.parse_event)

Please tell me what I should write in 'What To Write Here?' portion.

Sohan Das · Accepted Answer · 2018-10-12 16:51:19Z

2

View page source of the url then copy line 76 - 9045 and save as data.json in your local drive then use this code...

import json
from bs4 import BeautifulSoup
import requests
req = requests.get('https://highape.com/bangalore/all-events')
soup = BeautifulSoup(req.content, 'html.parser')
js = soup.find_all('script')[5].text
data = json.loads(js, strict=False)
for i in data:
    url = i['url']
    print(url)
    ##callback with scrapy

edited Oct 12, 2018 at 16:51

answered Oct 12, 2018 at 13:44

Sohan Das

1,6302 gold badges17 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Debbie Over a year ago

Hi, your solution worked. But as u see the url was for Bangalore city. highape.com/bangalore/all-events For only Bangalore I am maintaining a big file in my machine. Also new events will be kept adding and old events will be removed. So I have to update the content of that file everyday. Also for all cities it's practically impossible to maintain big files in local. So your solution is impractical. Could you please suggest me something else?

Sohan Das Over a year ago

Answer updated! if you like please give upvote and accept!

Debbie Over a year ago

Sure. I just need some time to check if the answer works for me.

nosklo · Accepted Answer · 2018-10-13 18:39:59Z

0

What to write here?

events = response.xpath("//script[@type='application/ld+json']").extract()
events = json.loads(events[0])

edited Oct 13, 2018 at 18:39

answered Oct 12, 2018 at 17:42

nosklo

224k58 gold badges300 silver badges299 bronze badges

3 Comments

Debbie Over a year ago

response.xpath("//script[@type='application/ld+json']").extract() - fetches line 75 to line 9046 on view source page. This line: events = json.loads(events) gives this error: TypeError: expected string or buffer. If I modify the second line and write: for event in events: event1 = json.loads(event) I get this error: ValueError: No JSON object could be decoded

nosklo Over a year ago

@Debbie looks like we have to call str() on it? Edited my answer

pwinz Over a year ago

events is a list b/c extract() returns a list. In this case there are two elements retrieved with that xpath. If you call events = response.xpath("//script[@type='application/ld+json']/text()") .extract_first() you'll get the desired data. You'll still get a JSONDecodeError because there are literal \r and \n chars in the data (first example is right after "All these games are developed by highly skilled developers who ensure that the"). See here for help

Collectives™ on Stack Overflow

Python + Scrapy + JSON + XPath : How to scrape JSON data with Scrapy

2 Answers 2

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related