0

I currently want to scrape some data from an amazon page and I'm kind of stuck.

For example, lets take this page.

https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1

I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.

enter image description here

There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.

For example, in asinToDimentionIndexMap we can see

"B01KWIUH5M":[0,0]

Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)

I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.

Another person in the site (thanks for the help btw) suggested doing it this way.

script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_') 

import json
d = json.loads(data[0])
d['products'][0]

I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.

Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.

Thanks for the help!

EDIT: Added photo of variationValues and asinToDimensionIndexMap

2
  • Runing d['products'][0] doesn't work on the amazon site. You'll need to look more into the specific structure for your case. I've posted an answer below where you can see a visual representation of the amazon site's json structure. Commented Jan 6, 2019 at 0:41
  • "variationValues" : {"size_name":["8 M US","8.5 M US","9.5 M US","10 M US","10.5 M US"],"color_name":["Teal"]} "dimensionValuesData" : [["8 M US","8.5 M US","9.5 M US","10 M US","10.5 M US"],["Teal"]] What do you actually need from the web ?\ Commented Jan 7, 2019 at 17:58

2 Answers 2

1

I think you are close Manuel!

The following code will turn your scraped source into easy-to-select boxes:

import json
d = json.loads(data[0])

JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.

https://www.w3schools.com/js/js_json_intro.asp

I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.

Your code format looks correct, but your access within "each box" may look different.

Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):

d['products'][0]['asinToDimentionIndexMap']

I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.

JSON Object Viewer

For example, the following would yield "companyCompliancePolicies_feature_div":

import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']

The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you so much for the help!, I'm going to investigate a little bit the Json file for finding what i want, again, thank you!
Daniel, how are you, is it there a way to, for example, get the 'name' of a box. For example, will i get B01KWIUH5M if I use something along the lines of d['asinToDimentionsIndexMap'][0] Because, when the program runs, i 'don't know' each product code. Or is it the correct way for doing it, getting the product codes from somewhere else and then using them For example, getting every code from d['dimentionToAsinMap']
1
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]

Now you can easily convert them to json as use them combine as you wish.

2 Comments

That's just what i needed! Thank you so much!
always Pleasure :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.