1

I know how to fetch the XPATHs for HTML datapoints with Scrapy. But I have to scrape all the URLs(starting URLs), of this page on this site, which are written in JSON format:

https://highape.com/bangalore/all-events

view-source:https://highape.com/bangalore/all-events

I usually write this in this format:

def parse(self, response):
      events = response.xpath('**What To Write Here?**').extract()

      for event in events:
          absolute_url = response.urljoin(event)
          yield Request(absolute_url, callback = self.parse_event)

Please tell me what I should write in 'What To Write Here?' portion.

enter image description here

0

2 Answers 2

2

View page source of the url then copy line 76 - 9045 and save as data.json in your local drive then use this code...

import json
from bs4 import BeautifulSoup
import requests
req = requests.get('https://highape.com/bangalore/all-events')
soup = BeautifulSoup(req.content, 'html.parser')
js = soup.find_all('script')[5].text
data = json.loads(js, strict=False)
for i in data:
    url = i['url']
    print(url)
    ##callback with scrapy
Sign up to request clarification or add additional context in comments.

3 Comments

Hi, your solution worked. But as u see the url was for Bangalore city. highape.com/bangalore/all-events For only Bangalore I am maintaining a big file in my machine. Also new events will be kept adding and old events will be removed. So I have to update the content of that file everyday. Also for all cities it's practically impossible to maintain big files in local. So your solution is impractical. Could you please suggest me something else?
Answer updated! if you like please give upvote and accept!
Sure. I just need some time to check if the answer works for me.
0

What to write here?

events = response.xpath("//script[@type='application/ld+json']").extract()
events = json.loads(events[0])

3 Comments

response.xpath("//script[@type='application/ld+json']").extract() - fetches line 75 to line 9046 on view source page. This line: events = json.loads(events) gives this error: TypeError: expected string or buffer. If I modify the second line and write: for event in events: event1 = json.loads(event) I get this error: ValueError: No JSON object could be decoded
@Debbie looks like we have to call str() on it? Edited my answer
events is a list b/c extract() returns a list. In this case there are two elements retrieved with that xpath. If you call events = response.xpath("//script[@type='application/ld+json']/text()") .extract_first() you'll get the desired data. You'll still get a JSONDecodeError because there are literal \r and \n chars in the data (first example is right after "All these games are developed by highly skilled developers who ensure that the"). See here for help

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.