1

I'm trying to scrape a data from the website https://rsoe-edis.org/eventList and save to xlsx file. The scraper doesn't show any error but it skips some content. It saves all links but in some cases it doesn't show other information. Why?

import xlsxwriter
from datetime import datetime

now = (datetime.now()).strftime("%d-%m-%Y_%H-%M")

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

workbook = xlsxwriter.Workbook("RSOE_" + now + ".xlsx")

worksheet = workbook.add_worksheet("EventList") 

#Open the website
driver.get("https://rsoe-edis.org/eventList")

#Take events list
articles = driver.find_elements_by_tag_name("tr")
row = 0
col = 0

for article in articles:
        
        header = article.find_element_by_class_name("title")
        date = article.find_element_by_class_name("eventDate")
        location = article.find_element_by_class_name("location")
        link = article.find_element_by_tag_name("a")  
        worksheet.write(row, col,     header.text)
        worksheet.write(row, col + 1, date.text)
        worksheet.write(row, col + 2, location.text)
        worksheet.write(row, col + 3, link.get_attribute("href"))   

        print(header.text)

        row += 1      
workbook.close()      

driver.close()```

2 Answers 2

1

Problem explanation

The problem in your case is that there are many event cards that are hidden (have style attributes display:none;), and Selenium can't provide the text content of hidden elements via the webelements .text attribute.

Solution

To interact with the hidden elements, you could among others:

  • fetch the webelements attribute values instead (e.g. .get_attribute("innerText")
  • use raw JavaScript to unhide the elements and then continue with .text.
  • use raw JavaScript to fetch all the webelements

Example getting the element text content using .get_attribute()

Here i use the .get_attribute() method of the webelement to get the content via the innerText attribute, then the string .strip() method to remove leading and trailing whitespaces

driver.get("https://rsoe-edis.org/eventList")
articles = driver.find_elements_by_tag_name("tr")
with open("my_articles.csv", "wt") as f:
    for article in articles:
        header = article.find_element_by_class_name("title").get_attribute("innerText").strip()
        date = article.find_element_by_class_name("eventDate").get_attribute("innerText").strip()
        location = article.find_element_by_class_name("location").get_attribute("innerText").strip()
        link = article.find_element_by_tag_name("a").get_attribute("href")
        f.write(f"{header}, {date}, {location}, {link}\n")

Example unhiding elements with raw JavaScript enabling .text

Below is an example where I use the second alternative to remove the style="display:none;" attribute from all the hidden cards, then continue with the webelements .text attribute to get the text content. What you would need from this example is the 3 rows below the comment # Loop through event list and unhide all event cards

#Open the website
driver.get("https://rsoe-edis.org/eventList")

# Loop through event list and unhide all event cards
event_cards = driver.find_elements_by_class_name("event-card")
for card in event_cards:
    driver.execute_script("arguments[0].removeAttribute(\"style\")", card)

# Find all articles and add them to a file
articles = driver.find_elements_by_tag_name("tr")
with open("my_articles.csv", "wt") as f:
    for article in articles:
        header = article.find_element_by_class_name("title").text
        date = article.find_element_by_class_name("eventDate").text
        location = article.find_element_by_class_name("location").text
        link = article.find_element_by_tag_name("a").get_attribute("href")
        f.write(f"{header}, {date}, {location}, {link}\n")
Sign up to request clarification or add additional context in comments.

Comments

1
import xlsxwriter
from datetime import datetime

now = (datetime.now()).strftime("%d-%m-%Y_%H-%M")

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

workbook = xlsxwriter.Workbook("RSOE_" + now + ".xlsx")

worksheet = workbook.add_worksheet("EventList") 

#Open the website
driver.get("https://rsoe-edis.org/eventList")

#Take events list
articles = driver.find_elements_by_tag_name("tr")
row = 0
col = 0

for article in articles:
        
        header = article.find_element_by_class_name("title")
        date = article.find_element_by_class_name("eventDate")
        location = article.find_element_by_class_name("location")
        link = article.find_element_by_tag_name("a")  
        worksheet.write(row, col,     header.get_attribute("textContent"))
        worksheet.write(row, col + 1, date.get_attribute("textContent"))
        worksheet.write(row, col + 2, location.get_attribute("textContent"))
        worksheet.write(row, col + 3, link.get_attribute("href"))   

        print(header.get_attribute("textContent"))

        row += 1      
workbook.close()      

driver.close()

.text retrieves elements that is visible , use get_attribute("textContent") instead

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.