pythonic way to find column values of a dataframe in a given string

Question

I have a pandas dataframe like this:

data={
    'col1':['New Zealand', 'Gym', 'United States'],
    'col2':['Republic of South Africa', 'Park', 'United States of America'],
}
df=pd.DataFrame(data)
print(df)

            col1                      col2
0    New Zealand  Republic of South Africa
1            Gym                      Park
2  United States  United States of America

And I have a sentence that might contain words from any of the columns of the dataframe. I want to get the values in columns that are present in the sentence given and in which column they are. I have seen some similar solutions but they match the sentence given with the column values and not the other way around. Currently, I am doing it like this:

def find_match(df,sentence):
    "returns true/false depending on the matching value and column name where the value exists"
    arr=[]
    cols=[]
    flag=False
    for i,row in df.iterrows():
        if row['col1'].lower() in sentence.lower():
            arr.append(row['col1'])
            cols.append('col1')
            flag=True
        elif row['col2'].lower() in sentence.lower():
            arr.append(row['col2'])
            cols.append('col2')
            flag=True
    return flag,arr,cols

sentence="I live in the United States"
find_match(df,sentence)  # returns (True, ['United States'], ['col1'])

I want a more pythonic way to do this because it is consuming lots of time on quite a large dataframe and it doesn't seem pythonic to me.

I cannot use .isin() because it wants a list of strings and matches the column value with the whole sentence given. I have tried doing the following as well but it throws error:

df.loc[df['col1'].str.lower() in sentence]  # throws error that df['col1'] should be a string

Any help will be highly appreciated. Thanks!

SimonT · Accepted Answer · 2020-07-24 06:27:01Z

1

I would do something something like this:

def find_match(df,sentence):
    ids = [(i,j) for j in df.columns for i,v in enumerate(df[j]) if v.lower() in sentence.lower()]
    return len(ids)>0, [df[id[1]][id[0]] for id in ids], [id[1] for id in ids]

Which gives:

find_match(df, sentence = 'I regularly go to the gym in the United States of America')

(True,
 ['Gym', 'United States', 'United States of America'],
 ['col1', 'col1', 'col2'])

From my feeling this is quite pythonic although there might be more elegant ways making more use of pandas functions.

edited Jul 24, 2020 at 6:27

answered Jul 24, 2020 at 6:12

SimonT

4933 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Asim Over a year ago

great, but it is not giving the exact string that is matched, right?

SimonT Over a year ago

So you want you solution to contain only one value for each row? Thus a sentence containing gym and park should only return the position for gym?

SimonT Over a year ago

I just edited the answer such that strings instead of row ids are printed

jsmart · Accepted Answer · 2020-07-24 14:18:39Z

0

Evidently you would like to check each value in Col 1 is a sub-string of the sentence. Is this correct? If so, here is one way:

df = pd.DataFrame(
    {'col1': ['New Zealand', 'Gym', 'United States'],
    'col2': ['Republic of South Africa', 'Park', 'United States of America']})

sentence = 'I live in the United States'

mask = df['col1'].apply(lambda x: x in sentence) # `mask` is a boolean array

if mask.any():
    matches = df.loc[mask, 'col1']
    print(mask.any(), end=', ')
    print(df.loc[mask, 'col1'].values, end=', ')
    print('col1')
    print()

# the print statements produce the following line
# True, ['United States'], col1

If this is the right logic for one column, then you could put the mask statement and the if clause in a loop for col in df.columns:

Update: we can modify the lambda expression to perform case-insensitive comparison. (The original data frame is not changed.)

mask = df['col1'].apply(lambda x: x.lower() in sentence.lower())

edited Jul 24, 2020 at 14:18

answered Jul 24, 2020 at 6:16

jsmart

3,0111 gold badge9 silver badges14 bronze badges

2 Comments

Asim Over a year ago

it is not returning the matches if the value is in lower-case. Can you kindly tell me how to fix that? Apparently x.lower() is not working while masking.

jsmart Over a year ago

I added to_lower() to the lambda expression, which worked for my example.

Collectives™ on Stack Overflow

pythonic way to find column values of a dataframe in a given string

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related