Use OR in Lambda function - Web Scraping Python

Question

Using this example - How to extract html links with a matching word from a website using python

I wrote a web scraping script to look for keywords in recent and cashed versions of a local newspaper.

from bs4 import BeautifulSoup
import requests

urls = ["https://www.marinij.com/", 'https://web.archive.org/web/20210811185035/https://www.marinij.com/',
        'https://web.archive.org/web/20210506004633/https://www.marinij.com/','https://web.archive.org/web/20210211022431/https://www.marinij.com/',
        'https://web.archive.org/web/20201111174202/https://www.marinij.com/','https://web.archive.org/web/20200811204359/https://www.marinij.com/',
        'https://web.archive.org/web/20200511165943/https://www.marinij.com/','https://web.archive.org/web/20200209014056/https://www.marinij.com/',
        'https://web.archive.org/web/20191111061843/https://www.marinij.com/']

dates = ['today','aug2021','may2021','feb2021','nov2020','aug2020','may2020','feb2020','nov2019']

for i, (url,date) in enumerate(zip(urls,dates)):
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('corona' or 'covid') in tag.get_text().lower())
    
    results = soup.find_all(covid_links)

    num_art = str((len(results)))
    if not results:
        results = ["The term COVID did not appear in the headlines this quarter!\n"]

    textfile = open("marin_covid_" + date + ".txt", "w")
    for idx, element in enumerate(results):
        element = str(element)
        # print(element)
        if idx == 0:
            textfile.write(date + "\n" + "Number of articles = " + num_art + "\n" + "\n" + element + "\n")

        else:
            textfile.write(element + "\n" + "\n")
    textfile.close()

files = ['marin_covid_today.txt', 'marin_covid_aug2021.txt', 'marin_covid_may2021.txt', 'marin_covid_feb2021.txt', 'marin_covid_nov2020.txt',
        'marin_covid_aug2020.txt', 'marin_covid_may2020.txt', 'marin_covid_feb2020.txt']

with open("COVID_articles_in_MIJ.txt", "w") as outfile:
    for filename in files:
        print(filename)
        with open(filename) as infile:
            contents = infile.read()
            outfile.write(contents)

It works really well when using only 1 keyword but when I try using the "or" function to look for 1 or more keyword it is only searching for the 1st word. This can be replicated by switching the 2 keywords in the example - "covid" and "corona".

I know the problem lies in this lambda function but I'm not sure how to address.

    covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('corona' or 'covid') in tag.get_text().lower())

This code should be fully executable if you have the prerequisites installed, all help is appreciated.

The expression ('corona' or 'covid') evaluates to just 'corona', so that's all that's being searched for. There simply isn't anything you can put on the left side of the in operator to search for multiple values; you'd have to write this as (('corona' in X) or ('covid' in X)). — jasonharper
– jasonharper, Commented Nov 12, 2021 at 3:42
You don't seem to understand order of operations in Python. ('corona' or 'covid') evaluates to 'corona', so it's then checking if 'corona' is in tag.get_text().lower(). So do tag.attrs and ('corona' in tag.get_text().lower() or 'covid' in tag.get_text().lower()) — Tommy A.
– Tommy A., Commented Nov 12, 2021 at 3:43
That's actually very helpful, though you could have been less rude about it ¯_(ツ)_/¯ — John Conor
– John Conor, Commented Nov 12, 2021 at 3:52

John Conor · Accepted Answer · 2021-11-12 04:21:49Z

0

As pointed out in the comments the issue was that 'in' operator must be included either side of the 'or' operator, so that the attribute being evaluated; in this case tag.get_text().lower() can be evaluated for both conditions - "corona" and "covid". The correct lambda function is this:

covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('covid' in tag.get_text().lower() or 'corona' in tag.get_text().lower()))

answered Nov 12, 2021 at 4:21

John Conor

9041 gold badge8 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Use OR in Lambda function - Web Scraping Python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related