1

I've pieced together a script which scrapes various pages of products on a product search page, and collects the title/price/link to the full description of the product. It was developed using a loop and adding a +i to each page (www.exmple.com/search/laptops?page=(1+i)) until a 200 error applied.

The product title contains the link to the actual products full description - I would now like to "visit" that link and do the main data scrape from within the full description of the product.

I have an array built for the links extracted from the product search page - I'm guessing running off this would be a good starting block.

How would I go about extracting the HTML from the links within the array (ie. visit the individual product page and take the actual product data and not just the summary from the products search page)?

Here are the current results I'm getting in CSV format:

 Link                                Title                 Price
 example.com/laptop/product1        laptop                 £400
 example.com/laptop/product2        laptop                 £400
 example.com/laptop/product3        laptop                 £400
 example.com/laptop/product4        laptop                 £400
 example.com/laptop/product5        laptop                 £400
3
  • 1
    can you share your url? Commented Sep 20, 2019 at 13:33
  • 1
    guntrader.uk/dealers/street/ivythorn-sporting/guns -thats the page ive scraped the example data from (was using laptops as a generic example) but its the links to the full gun data im looking to scrape Commented Sep 20, 2019 at 13:46
  • 1
    Just an FYI - wget can recursively scrape pages out-of-the-box - it might help to make sure it's okay with the website's owner. Commented Sep 20, 2019 at 13:47

2 Answers 2

2

First get all pages link.Then iterate that list and get whatever info you need from individual pages. I have only retrieve specification values here.you do whatever value you want.

from bs4 import BeautifulSoup
import requests
all_links=[]
url="https://www.guntrader.uk/dealers/street/ivythorn-sporting/guns?page={}"
for page in range(1,3):
  res=requests.get(url.format(page)).text
  soup=BeautifulSoup(res,'html.parser')
  for link in soup.select('a[href*="/dealers/street"]'):
      all_links.append("https://www.guntrader.uk" + link['href'])

print(len(all_links))
for a_link in all_links:
    res = requests.get(a_link).text
    soup = BeautifulSoup(res, 'html.parser')
    if soup.select_one('div.gunDetails'):
      print(soup.select_one('div.gunDetails').text)

The output would be like from each page.

Specifications

Make:Schultz & Larsen
Model:VICTORY GRADE 2 SPIRAL-FLUTED
Licence:Firearm
Orient.:Right Handed
Barrel:23"
Stock:14"
Weight:7lb.6oz.
Origin:Other
Circa:2017
Cased:Makers-Plastic
Serial #:DK-V11321/P20119
Stock #:190912/002
Condition:Used



Specifications

Make:Howa
Model:1500 MINI ACTION [ 1-7'' ] MDT ORYX CHASSIS
Licence:Firearm
Orient.:Right Handed
Barrel:16"
Stock:13 ½"
Weight:7lb.15oz.
Origin:Other
Circa:2019
Cased:Makers-Plastic
Serial #:B550411
Stock #:190905/002
Condition:New



Specifications

Make:Weihrauch
Model:HW 35
Licence:No Licence
Orient.:Right Handed
Scope:Simmons 3-9x40
Total weight:9lb.3oz.
Origin:German
Circa:1979
Serial #:746753
Stock #:190906/004
Condition:Used

If you want to fetch title and price from each link.Try this.

from bs4 import BeautifulSoup
import requests
all_links=[]
url="https://www.guntrader.uk/dealers/street/ivythorn-sporting/guns?page={}"
for page in range(1,3):
  res=requests.get(url.format(page)).text
  soup=BeautifulSoup(res,'html.parser')
  for link in soup.select('a[href*="/dealers/street"]'):
      all_links.append("https://www.guntrader.uk" + link['href'])

print(len(all_links))
for a_link in all_links:
    res = requests.get(a_link).text
    soup = BeautifulSoup(res, 'html.parser')
    if soup.select_one('h1[itemprop="name"]'):
      print("Title:" + soup.select_one('h1[itemprop="name"]').text)
      print("Price:" + soup.select_one('p.price').text)
Sign up to request clarification or add additional context in comments.

7 Comments

Hi KunduK - good man ill have a look at this when i get home!
Genuinely wasnt expecting a full - "here you go" so Thanks! - Have you any documentation worth reading? looking at your code its actually very similar to what i had already regarding the various pages of products on the product search page
I have learned from official documentation only : crummy.com/software/BeautifulSoup/bs4/doc
@AndrewGlass : Sure.Let me know the status at your end once you run the code.
yeah works a treat - remembered had laptop in the car so ran down and pushed it out. 100% - few changes need to do as in push the data to a csv and add a few things but running as you steped out and as i asked so bingo good man yourself!
|
1

Just extract that part of the string which is a URL from the project title. do a :

import requests
res = requests.get(<url-extracted-above->)
res.content

then using the package beautifulsoup, do :

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

and keep iterating just taking this html as an xml-tree format. You may refer this easy to find link on requests and beautifulsoup : https://www.dataquest.io/blog/web-scraping-tutorial-python/

Hope this helps? not sure If I got your question correct but anything in here can be done with urllib2 / requests / beautifulSoup / json / xml python libraries when it copes to web scraping / parsing.

1 Comment

need to do some more reading on it and doing a few tests and playing about but might have got the exact answer above - obviously going to go through it and make sure understand whats going on referring to the documentations provided. Cheers!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.