Scraping data from a http & javaScript site

Question

I currently want to scrape some data from an amazon page and I'm kind of stuck.

For example, lets take this page.

https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1

I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.

There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.

For example, in asinToDimentionIndexMap we can see

"B01KWIUH5M":[0,0]

Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)

I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.

Another person in the site (thanks for the help btw) suggested doing it this way.

script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_') 

import json
d = json.loads(data[0])
d['products'][0]

I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.

Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.

Thanks for the help!

EDIT: Added photo of variationValues and asinToDimensionIndexMap

Runing d['products'][0] doesn't work on the amazon site. You'll need to look more into the specific structure for your case. I've posted an answer below where you can see a visual representation of the amazon site's json structure. — Daniel Scott
– Daniel Scott, Commented Jan 6, 2019 at 0:41
"variationValues" : {"size_name":["8 M US","8.5 M US","9.5 M US","10 M US","10.5 M US"],"color_name":["Teal"]} "dimensionValuesData" : [["8 M US","8.5 M US","9.5 M US","10 M US","10.5 M US"],["Teal"]] What do you actually need from the web ?\ — ThunderMind
– ThunderMind, Commented Jan 7, 2019 at 17:58

Daniel Scott · Accepted Answer · 2019-01-06 00:34:37Z

1

I think you are close Manuel!

The following code will turn your scraped source into easy-to-select boxes:

import json
d = json.loads(data[0])

JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.

https://www.w3schools.com/js/js_json_intro.asp

I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.

Your code format looks correct, but your access within "each box" may look different.

Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):

d['products'][0]['asinToDimentionIndexMap']

I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.

JSON Object Viewer

For example, the following would yield "companyCompliancePolicies_feature_div":

import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']

The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.

edited Jan 6, 2019 at 0:34

answered Jan 6, 2019 at 0:13

Daniel Scott

9857 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Manuel Over a year ago

Thank you so much for the help!, I'm going to investigate a little bit the Json file for finding what i want, again, thank you!

Manuel Over a year ago

Daniel, how are you, is it there a way to, for example, get the 'name' of a box. For example, will i get B01KWIUH5M if I use something along the lines of d['asinToDimentionsIndexMap'][0] Because, when the program runs, i 'don't know' each product code. Or is it the correct way for doing it, getting the product codes from somewhere else and then using them For example, getting every code from d['dimentionToAsinMap']

ThunderMind · Accepted Answer · 2019-01-07 18:15:51Z

1

variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]

Now you can easily convert them to json as use them combine as you wish.

answered Jan 7, 2019 at 18:15

ThunderMind

7995 silver badges15 bronze badges

2 Comments

Manuel Over a year ago

That's just what i needed! Thank you so much!

ThunderMind Over a year ago

always Pleasure :)

Collectives™ on Stack Overflow

Scraping data from a http & javaScript site

2 Answers 2

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related