1

The question I have is regarding the identification of a series of string in python. Let me explain what I am trying to do:

A string such as tom and jerry could also be written as in lowercase

  1. tom n jerry
  2. tom_jerry
  3. tom & jerry
  4. tom and jerry

and so on and so forth. As you can see there in the minimal example, there were 4 possible ways where even if I created a dictionary with these 3 ways, i will miss out on a string containing tom _ jerry. What can I do to recognize tom and jerry, creating many rules seems very inefficient. Is there a more efficient way to do this ?

7
  • 2
    Recognizing all possibilities will require artificial intelligence. Maybe a NLP library can do this. Commented Aug 26, 2022 at 20:07
  • 3
    It really depends on your acceptance criteria. For this specific example you could do s.startswith("tom") and s.endswith("jerry") to test a given string s and it'd return true for all of the examples. But it would also return true for really huge strings that you might not want to accept, and it would return false on minor mispellings of either tom or jerry, which you also might not want. Commented Aug 26, 2022 at 20:08
  • 3
    A better approach might be to compute the Levenshtein distance (which is relatively straightforward) and decide on a particular threshold that is "close enough" for your purposes. Commented Aug 26, 2022 at 20:09
  • 1
    There are fundamentally two ways to do this: (1) Use NLP. This might be a bit over-the-top depending on your use case. (2) Create a set of rules, such as s.startswith("tom") and s.endswith("jerry") and len(s) < 15 Commented Aug 26, 2022 at 20:13
  • @Lecdi I dont think I can do that because the string is part of sentence, Commented Aug 26, 2022 at 20:15

2 Answers 2

2

This will find any of those combinations in a sentence:

combo = "tom n jerry"
string = "This is an episode of" + combo + "that deals with something."
substring = string[string.find("tom"):string.find("jerry")+5]
print(substring)
Sign up to request clarification or add additional context in comments.

Comments

1

You could attempt this using a sequence matcher.

from difflib import SequenceMatcher

def checkMatch(firstWord: str, secondWord: str, strictness: float):
    ratio = SequenceMatcher(None, firstWord.strip(), secondWord.strip()).ratio()
    if ratio > strictness:
        return 1
    return 2

if __name__ == "__main__":
    originalWord = "tom and jerry"
    toMatch = "tom_jerry" # chose this one as it is the least likely in your example
    toMatch.lower() # easier to match if you lower or upper both the original and the match
    strictness = 0.6 # a strictness of 0.6 would mean the words are generally pretty similiar
    print(checkMatch(originalWord, toMatch, strictness))

You can learn more about how sequence matcher works here: https://towardsdatascience.com/sequencematcher-in-python-6b1e6f3915fc

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.