Selenium/BeautifulSoup - Python - Loop Through Multiple Pages

Question

I have spent the better part of my day researching and testing the best way to loop through a set of products on a retailer's website.

While I am successfully able to collect the set of products (and attributes) on the first page, I have been stumped on figuring out the best way to loop through the pages of the site to continue my scrape.

Per my code below, I have attempted to use a 'while' loop and Selenium to click on the 'next page' button of the website and then continue to collect products.

The issue is that my code still doesn't get past page 1.

Am I making a silly error here? Read 4 or 5 similar examples on this site, but none were specific enough to provide the solve here.

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products.clear()
hyperlinks.clear()
reviewCounts.clear()
starRatings.clear()

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1


html_soup = BeautifulSoup(driver.page_source, 'html.parser')
prod_containers = html_soup.find_all('li', class_ = 'products_grid')


while (pageCounter < maxPageCount):
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    counterProduct +=1
    print(counterProduct)

I’ll have to test and play with this later cuz not near a computer, but first thing I notice is that your html_souo and prod_containers isn’t in the loop. You parse, then iterate the first page, but dont do anytime after that first page. Once you iterate through that from a page, after you click to the next page, you’ll need to parse the html and find_all with that products_grid again. So I’d move the whole statement right before your html_soup line. — chitown88
– chitown88, Commented Dec 28, 2018 at 23:29
I also think you mean to have ‘pageCounter += 1’, not ‘counterProduct’? — chitown88
– chitown88, Commented Dec 28, 2018 at 23:34

chitown88 · Accepted Answer · 2018-12-28 23:54:47Z

You need to parse each time you "click" on next page. So you'll want to have that included within your while loop, otherwise you're just going to continue to iterate over the the 1st page, even when it clicks to the next page, because the prod_containers object never changes.

Secondly, the way you have it, your while loop will never stop because you set pageCounter = 0, but never increment it...it will forever be < your maxPageCount.

I fixed those 2 things in the code and ran it, and it appears to have worked and parsed pages 1 through 5.

from selenium import webdriver
from bs4 import BeautifulSoup
import re

driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0

html_soup = BeautifulSoup(driver.page_source, 'html.parser')
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1

prod_containers = html_soup.find_all('li', class_ = 'products_grid')


while (pageCounter < maxPageCount):
    html_soup = BeautifulSoup(driver.page_source, 'html.parser')
    prod_containers = html_soup.find_all('li', class_ = 'products_grid')
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            name = name.strip()
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    pageCounter +=1
    print(pageCounter)

Thank you @chitown88, I ran the code and found that in some cases it skips over products in the loop. I applied sleep logic and slightly tweaked my definition of my maxpage logic and it is all working swimmingly now. Much appreciate the second pair of eyes on this!

fullStackChris · Accepted Answer · 2018-12-28 23:35:47Z

Ok, this snippet of code will not run when run alone from a .py file, I'm guessing you were running it in iPython or a similar environment and had these vars already initialized and libraries imported.

First off, you need to include the regex package:

import re

Also, all those clear() statements are not necessary, since you initialize all those lists anyway (actually python throws an error anyway, because those lists haven't been defined yet when you call clear on them)

Also you needed to initialize counterProduct:

counterProduct = 0

and finally you have to set a value to your html_soup before referencing it in your code:

html_soup = BeautifulSoup(driver.page_source, 'html.parser')

here is the corrected code, which is working:

from selenium import webdriver
from bs4 import BeautifulSoup
import re

driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1
prod_containers = html_soup.find_all('li', class_ = 'products_grid')
counterProduct = 0
while (pageCounter < maxPageCount):
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    counterProduct +=1
    print(counterProduct)

Collectives™ on Stack Overflow

Selenium/BeautifulSoup - Python - Loop Through Multiple Pages

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related