2

I have two df with two string columns as below:

Df1: Original df has 2000 rows of Names

Id    Name
1     Paper
2     Paper
3     Scissors
4     Mat
5     Cat
6     Cat

2nd Df: Original df has 1000+ Item_Name

Item_ID   Item_Name
1         Paper Bag
2         wallpaper
3         paper
4         cat cage

I need strings in column Name that are in column Item_Name

1st Approach: By using str.contains:

I know how to match string if their was one column and a couple of strings to be matched like below:

df[df['Name'].str.contains("paper|cat", na=False)]

But how to do when there are two columns of string (name & Item_name) to be matched?

2nd Approach: Fuzzywuzzy

matched = []
for row in df1.index:
    name = df1.get_value(row,"Name")
    for columns in df2.index:
        item_name=df2.get_value(columns,"Item_Name")
        matched_token=fuzz.token_sort_ratio(name,item_name)
        if matched_token> 80:
            matched.append([name,item_name,matched_token])

The problem will be, it will be slow and my intended output what I want is litle more which I get from fuzzywuzzy. The output looks like:

Id Name     Item_ID
1  Paper     1,2,3
2  Paper     1,2,3
3  Scissors  NA 
4  Mat       NA 
5  Cat       4
6  Cat       4 

Summarize:

  1. How to do str.contains if their are two df with different column name
  2. How to transform the df to get the above intended output

3 Answers 3

3

You can use pd.Series.apply with a custom function:

def matcher(x):
    res = df2.loc[df2['Item_Name'].str.contains(x, regex=False, case=False), 'Item_ID']
    return ','.join(res.astype(str))

df1['Item_ID'] = df1['Name'].apply(matcher)

print(df1)

   Id      Name Item_ID
0   1     Paper   1,2,3
1   2     Paper   1,2,3
2   3  Scissors        
3   4       Mat        
4   5       Cat       4
5   6       Cat       4

There are ways you can make this more efficient:

  • Only work on unique items in df1['Name']: apply is a row-wise loop.
  • Use list comprehensions instead of pd.Series.apply. Both are Python-level loops, but list comprehensions often outperform versus Pandas str methods.

But the above don't improve algorithmic complexity. For orders of magnitude better improvement, you should consider a trie-based algorithm, such as this answer utilizing the Aho–Corasick algorithm.

Sign up to request clarification or add additional context in comments.

11 Comments

your item_ID column is String
@min2bro, Yes, so is Item_ID in OP's desired output.
@jpp: I will try this and get back to you
Since you mentioned that it works only on unique. No, that's not what my post says. It says if you want to improve performance, remove duplicate values first. I can't replicate your problem with matching with different Item_ID.
@RahulAgarwal, Unfortunately, I think it's beyond a tweak, best to ask a new question (with a complete example so people understand what you mean!).
|
1

You can use df.apply here

def get_vals(df):
    return ','.join(map(str, df2.loc[df2['Item_Name'].str.contains(df['Name'], case=False),]['Item_ID'].tolist()))

df1['Item_ID'] = df1.apply(get_vals, axis=1)

Output:

Id     Name  Item_ID
1     Paper   1,2,3
2     Paper   1,2,3
3  Scissors        
4       Mat        
5       Cat       4
6       Cat       4

Believe this will give you your desired result

1 Comment

Getting an error: TypeError: ("'Series' objects are mutable, thus they cannot be hashed", u'occurred at index 0')
0
df=pd.DataFrame({'ID':[1,2,3,4,5,6],'Name':['paper','paper','scissors','mat','cat','cat']})
df1=pd.DataFrame({'ID':[1,2,3,4],'Name':['paper bag','wallpaper','paper','cat cage']})


import numpy as np
def lookup_prod(ip):
    lst=[]
    for idx,row in df1.iterrows():
        if ip in row['Name']:
            lst.append(row['ID'])    
    if not lst:
        return np.NaN
    return lst

df['Item_ID'] = df['Name'].apply(lookup_prod)

Output:

 ID  Name         Item_ID
0   1   paper     [1, 2, 3]
1   2   paper     [1, 2, 3]
2   3   scissors    NaN
3   4   mat         NaN
4   5   cat         [4]
5   6   cat         [4]

1 Comment

It is giving me all rows as Nan. I your example you have taken column names of both df as same. Can you change the same, so I can understand which ID and which Name you are referring in your code

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.