As I am scraping, the page is dynamic with the 'load more' button. I used selenium for that. The first problem is that it is only working only one time. means clicking load more button the only first time. The second problem is that it is scraping only the articles that are before the first load more button. Not scraping after that. The third problem is that it is scraping all the articles twice. The fourth problem is I only want the date but it is giving along with the date, the author and place also.
import time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
base = "https://indianexpress.com"
browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
wait = WebDriverWait(browser, 10)
browser.get('https://indianexpress.com/?s=cybersecurity')
while True:
try:
time.sleep(6)
show_more = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'Load More')))
show_more.click()
except Exception as e:
print(e)
break
soup = BeautifulSoup(browser.page_source,'lxml')
search_results = soup.find('div', {'id':'ie-infinite-scroll'})
links = search_results.find_all('a')
for link in links:
link_url = link['href']
response = requests.get(link_url)
sauce = BeautifulSoup(response.text, 'html.parser')
dateTag = sauce.find('div', {'class':'m-story-meta__credit'})
titleTag = sauce.find('h1', {'class':'m-story-header__title'})
contentTag = ' '.join([item.get_text(strip=True) for item in sauce.select("[class^='o-story-content__main a-wysiwyg'] p")])
date = None
title = None
content = None
if isinstance(dateTag, Tag):
date = dateTag.get_text().strip()
if isinstance(titleTag, Tag):
title = titleTag.get_text().strip()
print(f'{date}\n {title}\n {contentTag}\n')
time.sleep(3)
There is no error in this code. But it needs refinement. What should I do to solve above-mentioned problems?
Thanks.