(Python) Scrapy - How to scrape a JS dropdown list?

Question

I want to scrape the javascript list of the 'size' section of this address:

http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119

What I want to do is get the sizes that are in stock, it will return a list. How would I be able to do it?

Here's my full code:

# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request

class ShoesSpider(Spider):
    name = "shoes"
    allowed_domains = ["store.nike.com"]
    start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']

    def parse(self, response):       
        shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
        for shoe in shoes:
            yield Request(shoe, callback=self.parse_shoes) 

    def parse_shoes(self, response):
        name = response.xpath('//*[@itemprop="name"]/text()').extract_first()
        price = response.xpath('//*[@itemprop="price"]/text()').extract_first()
        #sizes = ??

        yield {
            'name' : name,
            'price' : price,
            'sizes' : sizes
        }

Thanks

GoTrained · Accepted Answer · 2017-03-06 06:02:54Z

Here is the code to extract sizes in stock.

import scrapy


class ShoesSpider(scrapy.Spider):
    name = "shoes"
    allowed_domains = ["store.nike.com"]
    start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']

    def parse(self, response):
        sizes = response.xpath('//*[@class="nsg-form--drop-down exp-pdp-size-dropdown exp-pdp-dropdown two-column-dropdown"]/option')


        for s in sizes:
            size = s.xpath('text()[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]').extract_first('').strip()
            yield{'Size':size}

Here is the result:

M 4 / W 5.5
M 4.5 / W 6
M 6.5 / W 8
M 7 / W 8.5
M 7.5 / W 9
M 8 / W 9.5
M 8.5 / W 10
M 9 / W 10.5

In the for loop, if we write it like this, it will extract all the sizes, whether they are in stock or not.

size = s.xpath('text()').extract_first('').strip()

But if you want to get those that are in stock only, they are marked with the class "exp-pdp-size-not-in-stock selectBox-disabled" which you have to exclude through adding this:

[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]

I have tested it on other shoe pages, and it works as well.

Umair Ayub · Accepted Answer · 2017-03-04 15:16:13Z

1

Sizes are being loaded by an AJAX call.

So you will have to make another request to that AJAX URL in order to scrape Sizes.

Here is fully working code. (I have not run code on my side but I am sure its working)

# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request
import json

class ShoesSpider(Spider):
    name = "shoes"
    allowed_domains = ["store.nike.com"]
    start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']

    def parse(self, response):       
        shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
        for shoe in shoes:
            yield Request(shoe, callback=self.parse_shoes) 

    def parse_shoes(self, response):
        data = {}
        data['name'] = response.xpath('//*[@itemprop="name"]/text()').extract_first()
        data['price'] = response.xpath('//*[@itemprop="price"]/text()').extract_first()
        #sizes = ??


        sizes_url = "http://store.nike.com/html-services/templateData/pdpData?action=getPage&path=%2Fus%2Fen_us%2Fpd%2Fmagista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat%2Fpid-11229710%2Fpgid-11918119&productId=11229710&productGroupId=11918119&catalogId=100701&cache=true&country=US&lang_locale=en_US"
        yield Request(url = sizes_url, callback=self.parse_sizes, meta={'data':data}) 


        def parse_shoes(self, response):

            resp = json.loads(response.body)

            data = response.meta['data']

            sizes = resp['response']['pdpData']['skuContainer']['productSkus']

            sizesArray = []

            for a in sizes:
                sizesArray.extend([a["displaySize"]])

            yield {
            'name' : data['name'],
            'price' : data['price'],
            'sizes' : sizesArray}

NOTE:

The sizes_url will be different for each product, so you will have to spend some time to see what parameters it takes.

answered Mar 4, 2017 at 15:16

Umair Ayub

21.7k14 gold badges82 silver badges154 bronze badges

3 Comments

Lightness Races in Orbit Over a year ago

"(I have not run code on my side but I am sure its working)" This is a non sequitur.

Umair Ayub Over a year ago

@LightnessRacesinOrbit I have been a Python programmer for past 4 years, and only thing I added into code is to make an additional request to that AJAX URL ... So basic motive was to guide/tell OP that Sizes are not being loaded on product page itself, those are loaded by and additional AJAX call. Thanks.

Lightness Races in Orbit Over a year ago

I believe you missed my important point. And, by the way, four years is not a very long time.

Collectives™ on Stack Overflow

(Python) Scrapy - How to scrape a JS dropdown list?

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related