1

I'm using scrapy and I'm trying to test my selector using scrapy shell but nothing is working. I'm trying to scrape the JSON data on this website.

https://web.archive.org/web/20180604230058/https://api.simon.com/v1.2/tenant?mallId=231&key=40A6F8C3-3678-410D-86A5-BAEE2804C8F2&lw=true

I've tried to scrape the data using the selector

   response.css("body > pre::text").extract()

However, this doesn't seem to be working. Not sure what's wrong...

Ideally, I just want to get all the "Name: XXX" elements from the JSON data. So If you know how to select those specifically, that would be very helpful as well!

Currently my code looks like this

    # -*- coding: utf-8 -*-
    import scrapy # needed to scrape
    import sys    # need to import xlrd
    sys.path.extend("/Users/YoungFreeesh/anaconda3/lib/python3.6/site- 
    packages/") # needed to import xlrd
    import xlrd   # used to easily import xlsx file 

    class AmazonbotSpider(scrapy.Spider):
        name = 'ArchiveSpider'

        allowed_domains = ['web.archive.org']
        start_urls =['https://web.archive.org/web/20180604230058/https://api.simon.com/v1.2/tenant?mallId=231&key=40A6F8C3-3678-410D-86A5-BAEE2804C8F2&lw=true']

        def parse(self, response):
            print(response.body)
3
  • Re: "this doesn't seem to be working" — not sure anyone is a mind reader here. I could be wrong though... Commented Jun 11, 2018 at 20:16
  • I checked the networks log and it loads the json file from this url web.archive.org/web/20180604230058if_/https://api.simon.com/… .. Difference between both urls is 'if_'. See if this pattern matches with other links you have. You can use this hack to get your data. Commented Jun 11, 2018 at 20:19
  • @SP_ Thanks! That works. Commented Jun 11, 2018 at 20:53

1 Answer 1

1

Since the content is inside an iframe, it is a separate page, you have to navigate to the iframe first. Like a link, something like that:

urls = response.css('iframe::attr(src)').extract()
for url in urls :
    yield scrapy.Request(url...., target=parse_iframe)

then define a new parse_iframe method where you parse the iframes response.

Sign up to request clarification or add additional context in comments.

1 Comment

Here is a similar question: stackoverflow.com/questions/52779161/… Could you please answer?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.