Scraping content from divs with same class names into arrays [Python]

Question

I have been developing JavaScript for a decent time, but Python still feels a bit fresh to me. I'm trying to scrape the content from a simple webpage with Python (basically a product list with different sections). The content is dynamically generated so using the selenium module for this.

The content structure is like this with several product sections:

<div class="product-section">
    <div class="section-title">
        Product section name
    </div>
    <ul class="products">
        <li class="product">
            <div class="name">Wooden Table</div>
            <div class="price">99 USD</div>
            <div class="color">White</div>
        </li>
    </ul>
</div>

Python code for scraping the products:

driver = webdriver.Chrome()
driver.get("website.com")
names = driver.find_elements_by_css_selector('div.name')
prices = driver.find_elements_by_css_selector("div.price")
colors = driver.find_elements_by_css_selector('div.color')

allNames = [name.text for name in names]
allPrices = [price.text for price in prices]
allColors = [color.text for color in colors]

Right now I get the attributes of all products (see below) but I can't separate them from the different sections.

Current outcome
Wooden Table, 99 USD, White
Lawn Chair, 39 USD, Black
Tent - 4 Person, 299 USD, Camo
etc.

Desired outcome:
Outdoor Furniture
Wooden Table, 99 USD, White
Lawn Chair, 39 USD, Black

Camping Gear
Tent - 4 Person, 299 USD, Camo
Thermos, 19 USD, Metallic

The end goal is to output the contents into an excel product list, hence why I need to keep the sections separate (with their matching section title). Any idea how to keep them separate, even though they have the same class names?

Suggest you look at Beautiful Soup library at crummy.com/software/BeautifulSoup/bs4/doc — gahooa
– gahooa, Commented Apr 14, 2018 at 0:58
BeatifulSoup is very powerful library, but might be an overkill for simpler tasks - another api to learn. Vanilla selenium scraping is quite up for a task like this one. — Todor Minakov
– Todor Minakov, Commented Apr 14, 2018 at 10:00

Todor Minakov · Accepted Answer · 2018-04-14 09:58:40Z

You're almost there - to group the products by sections, then start off from a section and locate all elements within it. At least your sample html implies its structure allows it.

Based off your code, here's a solution with explanatory comments.

driver = webdriver.Chrome()
driver.get('website.com')

# a dict where the key will be the section name
products = {}

# find all top-level sections
sections = driver.find_elements_by_css_selector('div.product-section')

# iterate over each one
for section in sections:
    # find the products that are children of this section
    # note the find() is based of section, not driver
    names = section.find_elements_by_css_selector('div.name')
    prices = section.find_elements_by_css_selector('div.price')
    colors = section.find_elements_by_css_selector('div.color')

    allNames = [name.text for name in names]
    allPrices = [price.text for price in prices]
    allColors = [color.text for color in colors]

    section_name = section.find_element_by_css_selector('div.section-title').text

    # add the current scraped section to the products dict
    # I'm leaving it to you to match the name, price and color of each ;)

    products[section_name] = {'names': allNames,
                              'prices': allPrices,
                              'colors': allColors,}

# and here's how to access the result

# get the 1st name in a section:
print(products['Product section name']['names'][0])  # will output "Wooden Table"

# iterate over the sections and products:
for section in products:
    print('Section: {}'.format(section))
    print('All prices in the section:')
    for price in section['prices']:
       print(price)

Thank you so much! This is the exact structure I had in mind but did not know how to go about it.

Collectives™ on Stack Overflow

Scraping content from divs with same class names into arrays [Python]

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related