0

Using this example - How to extract html links with a matching word from a website using python

I wrote a web scraping script to look for keywords in recent and cashed versions of a local newspaper.

from bs4 import BeautifulSoup
import requests

urls = ["https://www.marinij.com/", 'https://web.archive.org/web/20210811185035/https://www.marinij.com/',
        'https://web.archive.org/web/20210506004633/https://www.marinij.com/','https://web.archive.org/web/20210211022431/https://www.marinij.com/',
        'https://web.archive.org/web/20201111174202/https://www.marinij.com/','https://web.archive.org/web/20200811204359/https://www.marinij.com/',
        'https://web.archive.org/web/20200511165943/https://www.marinij.com/','https://web.archive.org/web/20200209014056/https://www.marinij.com/',
        'https://web.archive.org/web/20191111061843/https://www.marinij.com/']

dates = ['today','aug2021','may2021','feb2021','nov2020','aug2020','may2020','feb2020','nov2019']

for i, (url,date) in enumerate(zip(urls,dates)):
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('corona' or 'covid') in tag.get_text().lower())
    
    results = soup.find_all(covid_links)

    num_art = str((len(results)))
    if not results:
        results = ["The term COVID did not appear in the headlines this quarter!\n"]

    textfile = open("marin_covid_" + date + ".txt", "w")
    for idx, element in enumerate(results):
        element = str(element)
        # print(element)
        if idx == 0:
            textfile.write(date + "\n" + "Number of articles = " + num_art + "\n" + "\n" + element + "\n")

        else:
            textfile.write(element + "\n" + "\n")
    textfile.close()

files = ['marin_covid_today.txt', 'marin_covid_aug2021.txt', 'marin_covid_may2021.txt', 'marin_covid_feb2021.txt', 'marin_covid_nov2020.txt',
        'marin_covid_aug2020.txt', 'marin_covid_may2020.txt', 'marin_covid_feb2020.txt']

with open("COVID_articles_in_MIJ.txt", "w") as outfile:
    for filename in files:
        print(filename)
        with open(filename) as infile:
            contents = infile.read()
            outfile.write(contents)

It works really well when using only 1 keyword but when I try using the "or" function to look for 1 or more keyword it is only searching for the 1st word. This can be replicated by switching the 2 keywords in the example - "covid" and "corona".

I know the problem lies in this lambda function but I'm not sure how to address.

    covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('corona' or 'covid') in tag.get_text().lower())

This code should be fully executable if you have the prerequisites installed, all help is appreciated.

3
  • 2
    The expression ('corona' or 'covid') evaluates to just 'corona', so that's all that's being searched for. There simply isn't anything you can put on the left side of the in operator to search for multiple values; you'd have to write this as (('corona' in X) or ('covid' in X)). Commented Nov 12, 2021 at 3:42
  • 1
    You don't seem to understand order of operations in Python. ('corona' or 'covid') evaluates to 'corona', so it's then checking if 'corona' is in tag.get_text().lower(). So do tag.attrs and ('corona' in tag.get_text().lower() or 'covid' in tag.get_text().lower()) Commented Nov 12, 2021 at 3:43
  • That's actually very helpful, though you could have been less rude about it ¯_(ツ)_/¯ Commented Nov 12, 2021 at 3:52

1 Answer 1

0

As pointed out in the comments the issue was that 'in' operator must be included either side of the 'or' operator, so that the attribute being evaluated; in this case tag.get_text().lower() can be evaluated for both conditions - "corona" and "covid". The correct lambda function is this:

covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('covid' in tag.get_text().lower() or 'corona' in tag.get_text().lower()))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.