0

I want to create new column in a df that shows two options when executing a function.

I have two lists:

lista = [A, B, C, D]
listb = [would not, use to, manage to, when, did not]

I want to find the first word that can appear from lista and return it in a new column called "Effect". If this is not found, then search for values from listband print the first encountered from listb along with it next 2 strings.

Example:

enter image description here

I have tried something like this:

def matcher(Description):
    for i in lista:
        if i in Description:
            return i
    return "Not found"

def matcher(Description):
    for j in listb:
        if j in Description:
            return j + 1
    return "Not found"

df["Effect"] = df.apply(lambda i: matcher(i["Description"]), axis=1)
df["Effect"] = df.apply(lambda j: matcher(j["Description"]), axis=1)
0

2 Answers 2

1

The code below should do what you want to achieve:

def matcher(sentence):
    match_list = [substr for substr in lista 
                      if substr in [ word 
                                 for word in sentence.replace(',',' ').split(" ")]]
    if match_list: # list with items evaluates to True, empty list to False
        return match_list[0]
    match_list = [substr for substr in listb if ' '+substr+' ' in sentence]
    if match_list:
        substr = match_list[0]
        return substr + " " + sentence.split(substr)[-1].replace(',',' ').strip().split(" ")[0]
    return "Not found"

df["Effect"] = df.Description.apply(matcher)

If the sentences come with more than a ',' in them consider to use regular expression replacement instead of .replace(',',' ') of all non-letter characters in the sentence with a space (so that words stay guaranteed separated) and be aware of the fact that some unusual cases of substrings and sentences can have unexpected side-effects.

UPDATE providing code for adding any number of words after substring matched from listb (requested in the comments) along with explanations how the code works:

lista = ['A', 'B', 'C', 'D']
listb = ["unable to", "would not", "was not", "did not", "there is not", "could not", "failed to", "use to", "manage to", "when"]
# ^-- listb extendend with phrases from another question on same subject

# I want the following, for example, there is the following text: 
sentence1 = "During procedure it was noted that A, was present and were notified to deparment."
#  In the above text exists A and it will be returned in a new column, only the A value.
sentence2 = "During procedure it was noted that product did not inject as expected."
#  In the above text I want to found "did not" and print these text 
# along with it next N strings ("did not inject" for N=1 and "did not inject as" for N-2

def matcher(sentence, no_words=1):
    # First find a match from lista: 
    match_list = [substr for substr in lista 
                      if substr in [ word 
                                 for word in sentence.replace(',',' ').split(" ")]]
    if match_list: # list with items evaluates to True, empty list to False
        return match_list[0] # if match found in lista exit function with return

    # There was no match from lista so find a match from listb:
    match_list = [substr for substr in listb if ' '+substr+' ' in sentence]
    if match_list:
        substr = match_list[0]
        # The code for returning the substr along with additional words from the sentence
        # splits the sentence on substr 'sentence.split(substr)' and gets the sentence text
        # after the substring by taking the end element of the list created by splitting
        # using the list index [-1] ( [1] would do it too ): sentence.split(substr)[-1]. 
        # .replace(',',' ') handles the case of words separated by ',' instead of ' '. 
        # .strip() handles the case of whitespaces at start and end of the part of 
        # extracted sentence. 
        # .split(" ") creates a list of words after substr in the sentence and the slice 
        # [0:no_words] takes 'no_words' amount of words from this list to join the words
        # to one string using ' '.join() in order to add it to substr:  
        return substr + " " + ' '.join(sentence.split(substr)[-1].replace(',',' ').strip().split(" ")[0:no_words])

    # There was no match from lista and list b (no value was yet returned)  so: 
    return "Not found"

print(matcher(sentence1))
print(matcher(sentence2)) # no_words=1 is default
print(matcher(sentence2, 2))

The code above outputs:

A
did not inject
did not inject as
Sign up to request clarification or add additional context in comments.

7 Comments

I substituted "setence" for the column name of the df "Description". This is only giving me the second function, for example, "did not inject". Is not showing values from "lista".
Try again using the current code in the answer which should work as expected delivering 'A' and 'did not inject' values for the two sentences you mentioned (have tested it and it works).
I have edited my question in order to make it more clear
If I want to print more than 1 next string in the second function, what do I need to modify? substr = match_list[0] return substr + " " + sentence.split(substr)[-1].replace(',',' ').strip().split(" ")[0]
Yes, it didn't work as I have not mentioned the ' '.join() required for it to work. See updated answer with added join() and detailed explanations of how the code achieves what it does.
|
0

You can do both at once:

def matcher(Description):
    w = [i for i in lista if i in Description]
    w.extend( [i for i in listb if i in Description] )
    if not w:
        return "Not found"
    else:
        return ' '.join(w)

df["Effect"] = df.apply(lambda i: matcher(i["Description"]), axis=1)

6 Comments

The question requests that if an item of lista is found in Description it will be returned. So you have to put if not w: directly after w = and only if w == []: run the second comprehension but with extending the found substring with one or two words following this substring in Description. Not clear is if Effect should list all the found items of lista/listb or only the first found one ...
I want the following, for example, there is the following text: "During procedure it was noted that A was present and were notified to deparment." In the above text A exists and it will be returned in a new column, only the A value. With the following text we have: "During procedure it was noted that product did not inject as expected" In the above text I want to found "did not" and print these text along with it next 1 strings, in this case "did not inject"
What if two or more items from lista or listb are found in the sentence? Return only the first one found? Or all?
Only the first one
What about "A strange result occurred"? How will you tell that's not what you want?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.