Scraping HTML from array of links

Question

I've pieced together a script which scrapes various pages of products on a product search page, and collects the title/price/link to the full description of the product. It was developed using a loop and adding a +i to each page (www.exmple.com/search/laptops?page=(1+i)) until a 200 error applied.

The product title contains the link to the actual products full description - I would now like to "visit" that link and do the main data scrape from within the full description of the product.

I have an array built for the links extracted from the product search page - I'm guessing running off this would be a good starting block.

How would I go about extracting the HTML from the links within the array (ie. visit the individual product page and take the actual product data and not just the summary from the products search page)?

Here are the current results I'm getting in CSV format:

 Link                                Title                 Price
 example.com/laptop/product1        laptop                 £400
 example.com/laptop/product2        laptop                 £400
 example.com/laptop/product3        laptop                 £400
 example.com/laptop/product4        laptop                 £400
 example.com/laptop/product5        laptop                 £400

guntrader.uk/dealers/street/ivythorn-sporting/guns -thats the page ive scraped the example data from (was using laptops as a generic example) but its the links to the full gun data im looking to scrape — Andrew Glass
– Andrew Glass, Commented Sep 20, 2019 at 13:46
Just an FYI - wget can recursively scrape pages out-of-the-box - it might help to make sure it's okay with the website's owner. — alex
– alex, Commented Sep 20, 2019 at 13:47

KunduK · Accepted Answer · 2019-09-20 14:02:11Z

2

First get all pages link.Then iterate that list and get whatever info you need from individual pages. I have only retrieve specification values here.you do whatever value you want.

from bs4 import BeautifulSoup
import requests
all_links=[]
url="https://www.guntrader.uk/dealers/street/ivythorn-sporting/guns?page={}"
for page in range(1,3):
  res=requests.get(url.format(page)).text
  soup=BeautifulSoup(res,'html.parser')
  for link in soup.select('a[href*="/dealers/street"]'):
      all_links.append("https://www.guntrader.uk" + link['href'])

print(len(all_links))
for a_link in all_links:
    res = requests.get(a_link).text
    soup = BeautifulSoup(res, 'html.parser')
    if soup.select_one('div.gunDetails'):
      print(soup.select_one('div.gunDetails').text)

The output would be like from each page.

Specifications

Make:Schultz & Larsen
Model:VICTORY GRADE 2 SPIRAL-FLUTED
Licence:Firearm
Orient.:Right Handed
Barrel:23"
Stock:14"
Weight:7lb.6oz.
Origin:Other
Circa:2017
Cased:Makers-Plastic
Serial #:DK-V11321/P20119
Stock #:190912/002
Condition:Used



Specifications

Make:Howa
Model:1500 MINI ACTION [ 1-7'' ] MDT ORYX CHASSIS
Licence:Firearm
Orient.:Right Handed
Barrel:16"
Stock:13 ½"
Weight:7lb.15oz.
Origin:Other
Circa:2019
Cased:Makers-Plastic
Serial #:B550411
Stock #:190905/002
Condition:New



Specifications

Make:Weihrauch
Model:HW 35
Licence:No Licence
Orient.:Right Handed
Scope:Simmons 3-9x40
Total weight:9lb.3oz.
Origin:German
Circa:1979
Serial #:746753
Stock #:190906/004
Condition:Used

If you want to fetch title and price from each link.Try this.

from bs4 import BeautifulSoup
import requests
all_links=[]
url="https://www.guntrader.uk/dealers/street/ivythorn-sporting/guns?page={}"
for page in range(1,3):
  res=requests.get(url.format(page)).text
  soup=BeautifulSoup(res,'html.parser')
  for link in soup.select('a[href*="/dealers/street"]'):
      all_links.append("https://www.guntrader.uk" + link['href'])

print(len(all_links))
for a_link in all_links:
    res = requests.get(a_link).text
    soup = BeautifulSoup(res, 'html.parser')
    if soup.select_one('h1[itemprop="name"]'):
      print("Title:" + soup.select_one('h1[itemprop="name"]').text)
      print("Price:" + soup.select_one('p.price').text)

answered Sep 20, 2019 at 14:02

KunduK

33.4k5 gold badges19 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Andrew Glass Over a year ago

Hi KunduK - good man ill have a look at this when i get home!

Andrew Glass Over a year ago

Genuinely wasnt expecting a full - "here you go" so Thanks! - Have you any documentation worth reading? looking at your code its actually very similar to what i had already regarding the various pages of products on the product search page

KunduK Over a year ago

I have learned from official documentation only : crummy.com/software/BeautifulSoup/bs4/doc

KunduK Over a year ago

@AndrewGlass : Sure.Let me know the status at your end once you run the code.

Andrew Glass Over a year ago

yeah works a treat - remembered had laptop in the car so ran down and pushed it out. 100% - few changes need to do as in push the data to a csv and add a few things but running as you steped out and as i asked so bingo good man yourself!

|

khanna · Accepted Answer · 2019-09-20 13:49:11Z

1

Just extract that part of the string which is a URL from the project title. do a :

import requests
res = requests.get(<url-extracted-above->)
res.content

then using the package beautifulsoup, do :

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

and keep iterating just taking this html as an xml-tree format. You may refer this easy to find link on requests and beautifulsoup : https://www.dataquest.io/blog/web-scraping-tutorial-python/

Hope this helps? not sure If I got your question correct but anything in here can be done with urllib2 / requests / beautifulSoup / json / xml python libraries when it copes to web scraping / parsing.

answered Sep 20, 2019 at 13:49

khanna

77510 silver badges26 bronze badges

1 Comment

Andrew Glass Over a year ago

need to do some more reading on it and doing a few tests and playing about but might have got the exact answer above - obviously going to go through it and make sure understand whats going on referring to the documentations provided. Cheers!

Collectives™ on Stack Overflow

Scraping HTML from array of links

2 Answers 2

7 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related