How to scrape elements in Selenium/Python by calling different css selectors at the same time?

Question

I am trying to select the title of posts that are loaded in a webpage by integrating multiple css selectors. See below my process:

Load relevant libraries

import time
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

Then load the content I wish to analyse

options = Options()
options.set_preference("dom.push.enabled", False)
browser = webdriver.Firefox(options=options)

browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[@type='search']").send_keys("international development",Keys.ENTER)
time.sleep(5)

scrolls = 2
while True:
    scrolls -= 1
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(5)
    if scrolls < 0:
        break

Then to get the content for each selector separately, call for css_selector

titles=browser.find_elements_by_css_selector("h3[class^='graf']")
TitlesList = []
for names in titles:
    names.text
    TitlesList.append(names.text) 

times=browser.find_elements_by_css_selector("time[datetime^='2016']")
Times = []
for names in times:
    names.text
    Times.append(names.text)

It all works so far...Now trying to bring them together with the aim to identify only choices from 2016

choices = browser.find_elements_by_css_selector("time[datetime^='2016'] and h3[class^='graf']")    
browser.quit()

On this last snippet, I always get an empty list.

So I wonder 1) How can I select multiple elements by considering different css_selector as conditions for selection at the same time 2) if the syntax to find under multiple conditions would be the same to link elements by using different approaches like css_selector or x_paths and 3) if there is a way to get the text for elements identified by calling for multiple css selectors along a similar line of what below:

[pair.text for pair in browser.find_elements_by_css_selector("h3[class^='graf']") if pair.text]

Thanks

jackblk · Accepted Answer · 2021-04-08 08:19:36Z

2

Firstly, I think what you're trying to do is to get any title that has time posted in 2016 right?

You're using CSS selector "time[datetime^='2016'] and h3[class^='graf']", but this will not work because its syntax is not valid (and is not valid). Plus, these are 2 different elements, CSS selector can only find 1 element. In your case, to add a condition from another element, use a common element like a parent element or something.

I've checked the site, here's the HTML that you need to take a look at (if you're trying to the title that published in 2016). This is the minimal HTML part that can help you identify what you need to get.

<div class="postArticle postArticle--short js-postArticle js-trackPostPresentation" data-post-id="d17220aecaa8"
    data-source="search_post---------2">
    <div class="u-clearfix u-marginBottom15 u-paddingTop5">
        <div class="postMetaInline u-floatLeft u-sm-maxWidthFullWidth">
            <div class="u-flexCenter">
                <div class="postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis">
                    <div
                        class="ui-caption u-fontSize12 u-baseColor--textNormal u-textColorNormal js-postMetaInlineSupplemental">
                        <a class="link link--darken"
                            href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
                            data-action="open-post"
                            data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
                            data-action-source="preview-listing">
                            <time datetime="2016-09-05T13:55:05.811Z">Sep 5, 2016</time>
                        </a>
                    </div>
                </div>
            </div>
        </div>
    </div>
    <div class="postArticle-content">
        <a href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
            data-action="open-post" data-action-source="search_post---------2"
            data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
            data-action-index="2" data-post-id="d17220aecaa8">
            <section class="section section--body section--first section--last">
                <div class="section-divider">
                    <hr class="section-divider">
                </div>
                <div class="section-content">
                    <div class="section-inner sectionLayout--insetColumn">
                        <h3 name="5910" id="5910" class="graf graf--h3 graf--leading graf--title">Reimagining
                            International Development for the 21st&nbsp;Century.</h3>
                    </div>
                </div>
            </section>
        </a>
    </div>
</div>

Both time and h3 are in a big div with class of postArticle. The article contains time published & the title, so it makes sense to get the whole article div that published in 2016 right?

Using XPATH is much more powerful & easier to write:

This will get all articles div that contains class name of postArticle--short: article_xpath = '//div[contains(@class, "postArticle--short")]'
This will get all time tag that contains class name of 2016: //time[contains(@datetime, "2016")]

Let's combine both of them. I want to get article div that contains a time tag with classname of 2016:

article_2016_xpath = '//div[contains(@class, "postArticle--short")][.//time[contains(@datetime, "2016")]]'
article_element_list = driver.find_elements_by_xpath(article_2016_xpath)

# now let's get the title
for article in article_element_list:
    title = article.find_element_by_tag_name("h3").text

I haven't tested the code yet, only the xpath. You might need to adapt the code to work on your side.

By the way, using find_element... is not a good idea, try using explicit wait: https://selenium-python.readthedocs.io/waits.html

This will help you to avoid making stupid time.sleep waits and improve your app performance, and you can handle errors pretty well.

Only use find_element... when you already located the element, and you need to find a child element inside. For example, in this case if I want to find articles, I will find by explicit wait, then after the element is located, I will use find_element... to find child element h3.

answered Apr 8, 2021 at 8:19

jackblk

1,2159 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Nicola Over a year ago

This is a fine input. I was specifically looking at css structure to call the subset of articles from 2016, I wish to know if there is an equivalent way. You have done it with xpath and by just appending the result to a list the data is scraped. I was testing away different syntax and combo of iterations with css though. I am stuck in formulating something along this line: article_2016_css="div[class*='postArticle--short']>time[datetime*='2016']" article_element_list_css = browser.find_elements_by_css_selector(article_2016_css) but still get an empty list..

jackblk Over a year ago

From what I know, there's no way to crawl this type of HTML structure via CSS efficiently. You can use CSS selector div[class*='postArticle--short']>div>div>div>div>div>a>time[datetime^="2016"] to get the right post that you need, but this will return a time tag, which you have to get the article content with extra steps (getting parent of parent of parent element... then get the h3 child). It's doable, but I don't think it's a good way to do. CSS selector is very limited. Using XPATH is a bit slower but easier to make the right condition to get what you need.

Nicola Over a year ago

I agree with your argument but just to fully understand the logic and limitation of css selectors. If I use the selector you shared, I get a list of items but then to get the article title I cannot use article.find_element_by_tag_name("h3").text instead I am trying to call the full path

for article in article_element_list_css:     titles = article.find_elements_by_css_selector("div[class*='postArticle--short']>div>a>section>div>div>h3[class*='graf']")

but even then I do not get a matching list between articles and year tags like with the XPATH.

jackblk Over a year ago

You can't do that with CSS selector (for now). Check here: developer.mozilla.org/en-US/docs/Web/CSS/:has. No browser supports it.

Collectives™ on Stack Overflow

How to scrape elements in Selenium/Python by calling different css selectors at the same time?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related