5
\$\begingroup\$

Why I made it

I’m building a voice assistant and thought it would be great if it could deliver news updates.

What it does

It cleans up the user’s query, determines whether it refers to a news category or a specific search term, and then calls the NewsAPI to fetch the most recent matching headline. After receiving the response, it extracts and tidies the article’s title, author, source, and description before returning a conversational summary which the text-to-speech module of my voice assistant (which I haven't posted yet) can say out loud with ease.

Code

Here is the Python code:


import requests
import json5


def NewsConcerning(text):
    """ Search for news concerning the text """
    text = (
        str(text)
        .replace("news", "")
        .replace("new", "")
        .replace("for ", "")
        .replace("concerning", "")
        .replace("can ", "")
        .replace("you ", "")
        .replace("give ", "")
        .replace("me ", "")
        .replace("the ", "")
        .replace("most ", "")
        .replace("recent ", "")
        .replace("in ", "")
        .replace("category ", "")
        .replace(" currently ", "")
        .replace("current ", "")
        .replace("now ", "")
        .replace("what ", "")
        .replace("whats ", "")
        .replace(" is ", "")
        .replace("about ", "")
        .replace(" ", "")
        .replace("today", "")
    )
    if "business" in text:
        text = "category=business"
    elif "entertainment" in text:
        text = "category=entertainment"
    elif "general" in text:
        text = "category=general"
    elif "health" in text:
        text = "category=health"
    elif "science" in text:
        text = "category=science"
    elif "sports" in text:
        text = "category=sports"
    elif "technology" in text:
        text = "category=technology"
    elif text == "":  # Eg input has been fully stripped and nothing is left
        text = "category=general"
    else:  # If specific text is queried:
        text = "q=" + text
    url = (
        "https://newsapi.org/v2/top-headlines?"
        + text
        + "&sortBy=publishedAt&pageSize=1&apiKey=fc8307a1197745a68e9a4479056fedd8"
    )
    response = requests.get(url)
    parsed_data = json5.loads(response.text)
    #print(parsed_data)
    try:
        #print(text)
        if "articles" in parsed_data and parsed_data["articles"]:
            print("<ALL OK>")
        first_article = parsed_data["articles"][0]
        name = first_article["source"]["name"]
        author = first_article["author"]
        title = first_article["title"]
        desc = first_article["description"]
        title = (
            title.replace(name, "")
            .replace(" - ", " ")
            .replace("LISTEN | ", "")
            .replace(
                " | Opinion",
                ". I have also come to notice how this is an opinion-based text.",
            )
            .replace(" | ", " ")
            .replace("Watch Live: ", "live, ")
            .replace("!!!!!!", ", followed by six exclamation marks")
            .replace("!!!!!", ", followed by five exclamation marks")
            .replace("!!!!", ", followed by four exclamation marks")
            .replace("!!!", ", followed by three exclamation marks")
            .replace("!!", ", followed by two exclamation marks")
            .replace("\n", "")
        )
        if author:
            if author.lower() != name.lower():
                author = author.replace(",", "and")
                author = " by " + author
                title = title.replace(author, "")
            else:
                author = ""
        else:
            author = ", though I did not find who wrote it"
        if desc == "":
            desc = "Sadly, I could not find any description."
        else:
            content = desc.split("<p>")  # In case description comes in HTML format
            if len(content) > 1:
                desc = content[1]
            desc = desc.replace("&quot;", "").replace(".com", "")
        if len(name):
            name = "on " + name
        return f"\n\nHere is the recently published news I found {name}{author}:\nThe title to it is {title}... \n{desc}"
    except IndexError:
        text = text.replace("q=", "").replace("category=", "")
        return f"I could not find anything concerning what you asked for. Mabe I misheard, were you asking about {text}? Try again if not."
    except KeyError:
        return "Whoops! Looks like you have been very interested in the news today! I'm sorry to say that my daily request limit, one hundred per day or fifty per twelve hours, has been reached."


i = input("Enter what you would ask of the voice assistant (eg: news concerning bitcoin): ")
print(NewsConcerning(i))

Critique request

Please, tell me anything that comes to mind.

New contributor
Chip01 is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
\$\endgroup\$

5 Answers 5

8
\$\begingroup\$

Request

Instead of:

url = (
    "https://newsapi.org/v2/top-headlines?"
    + text
    + "&sortBy=publishedAt&pageSize=1&apiKey=fc8307a1197745a68e9a4479056fedd8"
)
response = requests.get(url)

I find this more legible:

url = "https://newsapi.org/v2/top-headlines"
params = {
    "category": "general",
    "sortBy": "publishedAt",
    "pageSize": 1,
    "apiKey": "fc8307a1197745a68e9a4479056fedd8"
}
response = requests.get(url, params=params)

NB:

  • You can use requests.session() instead of requests.get...
  • Always check the status code. It should normally always be 200. Otherwise, the rest of the program is going to fail or at the very least yield unexpected results.
  • An API key should not be hardcoded. Use a .env file instead.

HTML parsing

This approach does not scale well:

desc = desc.replace("&quot;", "").replace(".com", "")

Who knows how many more HTML tags are going to show up over time. Use a lib like Beautifulsoup instead. Or lxml would be sufficient.

If you are always expecting a JSON response, it would be appropriate to build a small function that retrieves content from a URL, based on one or more parameters you supply (notably the category). Then you can have another dedicated function for parsing the JSON response. The idea is: separation of responsibility. One function should do one thing and do it well.

Also, isolating concerns makes debugging easier. For example, if the JSON response is invalid, you can jump straight to the function that is responsible for retrieving the content.

Securing credentials

The API key will most likely show up in web server logs as it is part of the URL. Whenever possible, transmit it differently, using headers for instance.

The documentation states that this method is supported:

You can attach your API key to a request in one of three ways:

Via the apiKey querystring parameter.
Via the X-Api-Key HTTP header.
Via the Authorization HTTP header. Including Bearer is optional, and be sure not to base 64 encode it like you may have seen in other authentication tutorials.

We strongly recommend the either of last two so that your API key isn't visible to others in logs or via request sniffing.

(emphasis is mine)

\$\endgroup\$
6
\$\begingroup\$

Naming

NewsConcerning is styled in the way PEP 8 would tell us to style a class name. As function it should be in snake case: news_concerning.

String replacements

Many of the strings you replace with spaces are problematic. Replacing "in " with "", for instance, would change "vein " to "ve".

To address this you can utilize word boundaries (\b) in a regular expression.

DRY

You call str.replace several times. Instead create a list of strings, join them with | and turn them into a regular expression that replaced any of them with "".

Taking these into consideration:

import re

def NewsConcerning(text):
    """ Search for news concerning the text """
    substrings = [
        "news", "new", "for ", "concerning", "can ",
        "you ", "give ", "me ", "the ", "most ", "recent ",
        "in ", "category ", " currently ", "current ", 
        "now ", "what ", "whats ", " is ", "about " 
        " ", "today"
    ]

    pat = re.compile("|".join("\\b" + re.escape(s) + "\\b" for s in substrings))

    text = pat.sub("", str(str))

This also removes the possibility for one call to str.replace to change the string so that a subsequent call will be triggered that otherwise wouldn't.

More repetition

Your conditional has more repetition.

    categories = (
        "business", "entertainment", "general", "health",
        "science", "sports"
    )
  
    cat_in_text = next((cat for cat in categories if cat in text), None)

    if cat_in_text:
        text = f"category={cat_in_text}"
    elif text == "":  # Eg input has been fully stripped and nothing is left
        text = "category=general"
    else:  # If specific text is queried:
        text = "q=" + text

URLs

But there's a problem with this approach. You should next be escaping this text variable when you insert it into your URL, but you can't without escaping that = you inserted intentionally.

__main__ guard

The final bit of code should be wrapped in a __main__ guard.

if __name__ == "__main__":
    i = input("Enter what you would ask of the voice assistant (eg:news concerning bitcoin): ")
    print(NewsConcerning(i))

Case sensitivity

As it stands, your program doesn't work very well if the user inputs mixed-case words. The fix is easy: convert the input to lowercase.

You can either convert the input immediately to lowercase, or let the NewsConcerning function do it.

if __name__ == "__main__":
    i = input("Enter what you would ask of the voice assistant (eg:news concerning bitcoin): ")
    print(NewsConcerning(lower(i)))

Exception handling

Your try is miles from your actual exception handlers. The code in your try block should be as absolutely minimal as possible.

Boolean tests and false-y values

In a couple of places you do boolean tests on the length of lists or on whether strings are empty. You should be aware that empty strings are false in Python, as are empty lists.

\$\endgroup\$
4
\$\begingroup\$

String Replacements and Matching

You should use a case-insensitive regular expression using the re.I flag to do your matching. For example:

import re
...

matching_words = (
    'business',
    'entertainment',
    ...
    'general'
)

matching_rex = re.compile(fr'\b({"|".join(matching_words)})\b', flags=re.I)
m = matching_rex.search(text)
if m:
    ...

If any of the words in matching_words contain characters that have special meaning in regular expressions (e.g. '.', '|', '(', etc.), then each word should be escaped either manually in matching_words or, preferably by calling re.escape on each word to be matched.

A case-insensitive regular expression can be used with re.sub doing your replacements of the input text. For the sake of efficiency, you should only do the replacement logic if you are unable to find a keyword match.

Docstrings and Comments

These would be helpful.

\$\endgroup\$
4
\$\begingroup\$

single responsibility

The SRP says that a function should do one thing well.

The expression that replaces "concerning" and other stop words should be in a helper function. For one thing, that would make it very easy to unit test.

consistency

        .replace(" currently ", "")
        .replace("current ", "")
        .replace("now ", "")
        .replace("what ", "")
        .replace("whats ", "")
        .replace(" is ", "")

It's a bit worrying that "new" has no SPACE characters, "for " ends with one, and a few words are surrounded by them. I would have a hard time writing a docstring that explains just what cleaned up result we're computing here. That is to say, there doesn't appear to be a proper specification for the cleanup behavior.

Also, if the input text is on the long side, then repeatedly scanning it for two dozen words is scaling poorly. Given that we already have a bunch of stop words, I would anticipate that in a few months we might accumulate, say, twice as many. That's just how code maintenance tends to go.

Consider putting a stake in the ground: "We parse text as English words, and remove stopwords." And then you might start to implement that spec with import nltk. That package offers nltk.word_tokenize(), which is robust and convenient.

Or you might go a bit further, such as saying you will strip punctuation, as you would apparently want a "what's" \$\rightarrow\$ "whats" conversion. You might even lemmatize, so a common word stem would match e.g. both "current" and "currently".

Consider putting your stop words in a set(), so that when you loop over for word in words: you can probe set membership in \$O(1)\$ constant time, rather than consuming time proportional to number of stopwords.

break out a helper

    if "business" in text:
        text = "category=business"
    ...
    elif "technology" in text:
        text = "category=technology"

This should be a for loop over the enumerated categories, with an assignment of text = f"category={category}", rather than the copy-pasta we see here. Any time you're tempted to start leaning on that CMD-V paste key, take a breath, step back, and think about what the appropriate function signature should be, so you can parameterize instead pasting then editing.

Also, don't put a secret API key in source code, checked into git, and certainly don't publish it to the world. Prefer to obtain it from an env var with os.environ["API_KEY"]

comments

The "followed by six exclamation marks" thing is weird enough that it deserves either a unit test which exercises that case, or a # comment or helper """docstring""" that explains the business Use Case. Apparently there's some site that is fond of exclamation marks, but it's unclear how that interferes with the proper functioning of NewsConcerning(). Which brings us back to: write an explicit spec for the function, please, preferably as a docstring. Then some months from now a maintainer can decide whether it is still delivering correct results, even as news content and formatting continues to change.

Oh, and PEP 8 asks that you spell it def news_concerning():

check for error at the site of the error

    response = requests.get(url)
    ...
    except KeyError:
        return "Whoops! ..."

That diagnostic message is very nice and helpful, thank you. But it's very odd to defer the check for so many dozens of lines down in the source code. We should not have to juggle "are we in error state here? / normal state?" while reading so much code that manipulates the abnormal text result.

The usual idiom is

    response = requests.get(url)
    response.raise_for_status()

on the assumption that a 400- or 500-series error won't let us accomplish anything useful with response.text. Or you might examine response.status

\$\endgroup\$
2
\$\begingroup\$

Just a high-level comment: A large part of the code consists of rule-based parsing of text (all the string-replace calls). This is extremely brittle (there are lots of cases that you won't cover) and error prone. Most importantly, it's not maintainable. It's perhaps good as a first stab, to get an idea about what you want to do, or build a rough first prototype, but it's practically impossible to build this out into a bigger program. I would advise to spent time at learning to use a language model (see the tutorials at the HuggingFace site), and either fine tune it or constrain it.

New contributor
mudskipper is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.