Using this example - How to extract html links with a matching word from a website using python
I wrote a web scraping script to look for keywords in recent and cashed versions of a local newspaper.
from bs4 import BeautifulSoup
import requests
urls = ["https://www.marinij.com/", 'https://web.archive.org/web/20210811185035/https://www.marinij.com/',
'https://web.archive.org/web/20210506004633/https://www.marinij.com/','https://web.archive.org/web/20210211022431/https://www.marinij.com/',
'https://web.archive.org/web/20201111174202/https://www.marinij.com/','https://web.archive.org/web/20200811204359/https://www.marinij.com/',
'https://web.archive.org/web/20200511165943/https://www.marinij.com/','https://web.archive.org/web/20200209014056/https://www.marinij.com/',
'https://web.archive.org/web/20191111061843/https://www.marinij.com/']
dates = ['today','aug2021','may2021','feb2021','nov2020','aug2020','may2020','feb2020','nov2019']
for i, (url,date) in enumerate(zip(urls,dates)):
r = requests.get(url)
soup = BeautifulSoup(r.content)
covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
('corona' or 'covid') in tag.get_text().lower())
results = soup.find_all(covid_links)
num_art = str((len(results)))
if not results:
results = ["The term COVID did not appear in the headlines this quarter!\n"]
textfile = open("marin_covid_" + date + ".txt", "w")
for idx, element in enumerate(results):
element = str(element)
# print(element)
if idx == 0:
textfile.write(date + "\n" + "Number of articles = " + num_art + "\n" + "\n" + element + "\n")
else:
textfile.write(element + "\n" + "\n")
textfile.close()
files = ['marin_covid_today.txt', 'marin_covid_aug2021.txt', 'marin_covid_may2021.txt', 'marin_covid_feb2021.txt', 'marin_covid_nov2020.txt',
'marin_covid_aug2020.txt', 'marin_covid_may2020.txt', 'marin_covid_feb2020.txt']
with open("COVID_articles_in_MIJ.txt", "w") as outfile:
for filename in files:
print(filename)
with open(filename) as infile:
contents = infile.read()
outfile.write(contents)
It works really well when using only 1 keyword but when I try using the "or" function to look for 1 or more keyword it is only searching for the 1st word. This can be replicated by switching the 2 keywords in the example - "covid" and "corona".
I know the problem lies in this lambda function but I'm not sure how to address.
covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
('corona' or 'covid') in tag.get_text().lower())
This code should be fully executable if you have the prerequisites installed, all help is appreciated.
('corona' or 'covid')evaluates to just'corona', so that's all that's being searched for. There simply isn't anything you can put on the left side of theinoperator to search for multiple values; you'd have to write this as(('corona' in X) or ('covid' in X)).('corona' or 'covid')evaluates to 'corona', so it's then checking if 'corona' is in tag.get_text().lower(). So dotag.attrs and ('corona' in tag.get_text().lower() or 'covid' in tag.get_text().lower())