Python with Selenium scraper skips some content

Question

I'm trying to scrape a data from the website https://rsoe-edis.org/eventList and save to xlsx file. The scraper doesn't show any error but it skips some content. It saves all links but in some cases it doesn't show other information. Why?

import xlsxwriter
from datetime import datetime

now = (datetime.now()).strftime("%d-%m-%Y_%H-%M")

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

workbook = xlsxwriter.Workbook("RSOE_" + now + ".xlsx")

worksheet = workbook.add_worksheet("EventList") 

#Open the website
driver.get("https://rsoe-edis.org/eventList")

#Take events list
articles = driver.find_elements_by_tag_name("tr")
row = 0
col = 0

for article in articles:
        
        header = article.find_element_by_class_name("title")
        date = article.find_element_by_class_name("eventDate")
        location = article.find_element_by_class_name("location")
        link = article.find_element_by_tag_name("a")  
        worksheet.write(row, col,     header.text)
        worksheet.write(row, col + 1, date.text)
        worksheet.write(row, col + 2, location.text)
        worksheet.write(row, col + 3, link.get_attribute("href"))   

        print(header.text)

        row += 1      
workbook.close()      

driver.close()```

tbjorch · Accepted Answer · 2021-03-13 12:43:26Z

Problem explanation

The problem in your case is that there are many event cards that are hidden (have style attributes display:none;), and Selenium can't provide the text content of hidden elements via the webelements .text attribute.

Solution

To interact with the hidden elements, you could among others:

fetch the webelements attribute values instead (e.g. .get_attribute("innerText")
use raw JavaScript to unhide the elements and then continue with .text.
use raw JavaScript to fetch all the webelements

Example getting the element text content using `.get_attribute()`

Here i use the .get_attribute() method of the webelement to get the content via the innerText attribute, then the string .strip() method to remove leading and trailing whitespaces

driver.get("https://rsoe-edis.org/eventList")
articles = driver.find_elements_by_tag_name("tr")
with open("my_articles.csv", "wt") as f:
    for article in articles:
        header = article.find_element_by_class_name("title").get_attribute("innerText").strip()
        date = article.find_element_by_class_name("eventDate").get_attribute("innerText").strip()
        location = article.find_element_by_class_name("location").get_attribute("innerText").strip()
        link = article.find_element_by_tag_name("a").get_attribute("href")
        f.write(f"{header}, {date}, {location}, {link}\n")

Example unhiding elements with raw JavaScript enabling `.text`

Below is an example where I use the second alternative to remove the style="display:none;" attribute from all the hidden cards, then continue with the webelements .text attribute to get the text content. What you would need from this example is the 3 rows below the comment # Loop through event list and unhide all event cards

#Open the website
driver.get("https://rsoe-edis.org/eventList")

# Loop through event list and unhide all event cards
event_cards = driver.find_elements_by_class_name("event-card")
for card in event_cards:
    driver.execute_script("arguments[0].removeAttribute(\"style\")", card)

# Find all articles and add them to a file
articles = driver.find_elements_by_tag_name("tr")
with open("my_articles.csv", "wt") as f:
    for article in articles:
        header = article.find_element_by_class_name("title").text
        date = article.find_element_by_class_name("eventDate").text
        location = article.find_element_by_class_name("location").text
        link = article.find_element_by_tag_name("a").get_attribute("href")
        f.write(f"{header}, {date}, {location}, {link}\n")

PDHide · Accepted Answer · 2021-03-12 17:48:45Z

import xlsxwriter
from datetime import datetime

now = (datetime.now()).strftime("%d-%m-%Y_%H-%M")

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

workbook = xlsxwriter.Workbook("RSOE_" + now + ".xlsx")

worksheet = workbook.add_worksheet("EventList") 

#Open the website
driver.get("https://rsoe-edis.org/eventList")

#Take events list
articles = driver.find_elements_by_tag_name("tr")
row = 0
col = 0

for article in articles:
        
        header = article.find_element_by_class_name("title")
        date = article.find_element_by_class_name("eventDate")
        location = article.find_element_by_class_name("location")
        link = article.find_element_by_tag_name("a")  
        worksheet.write(row, col,     header.get_attribute("textContent"))
        worksheet.write(row, col + 1, date.get_attribute("textContent"))
        worksheet.write(row, col + 2, location.get_attribute("textContent"))
        worksheet.write(row, col + 3, link.get_attribute("href"))   

        print(header.get_attribute("textContent"))

        row += 1      
workbook.close()      

driver.close()

.text retrieves elements that is visible , use get_attribute("textContent") instead

Collectives™ on Stack Overflow

Python with Selenium scraper skips some content

2 Answers 2

Problem explanation

Solution

Example getting the element text content using `.get_attribute()`

Example unhiding elements with raw JavaScript enabling `.text`

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Problem explanation

Solution

Example getting the element text content using .get_attribute()

Example unhiding elements with raw JavaScript enabling .text

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related

Example getting the element text content using `.get_attribute()`

Example unhiding elements with raw JavaScript enabling `.text`