1

Hi I have two DataFrames like below

 DF1

 Alpha   |  Numeric  |  Special

 and     |   1        |  @
 or      |   2       |  #
 lol ok  |   4       |  &






DF2 with single column

Content          

boy or girl  
school @ morn
pyc LoL ok student
Chandra

I want to search if anyone of the column in DF1 has anyone of the keyword in content column of DF2 and the output should be in a new DF

 `df11 = (df1.unstack()
      .reset_index(level=2,drop=True)
      .rename_axis(('col_order','col_name'))
      .dropna()
      .reset_index(name='val_low'))

 df22 = (df2['Content'].str.split(expand=True)
                 .stack()
                 .rename('val')
                 .reset_index(level=1,drop=True)
                 .rename_axis('idx')
                 .reset_index())`

 df22['val_low'] = df22['val'].str.lower()                    

 df = (pd.merge(df22, df11, on='val_low', how='left')
   .dropna(subset=['col_name'])
   .sort_values(['idx','col_order'])
   .drop_duplicates(['idx']))


 df = (pd.concat([df2, df.set_index('idx')], axis=1)
   .fillna({'col_name':'Other'})[['val','col_name','Content']])

but it is not considering the spaces between lol ok

 expected_output_DF

     val      col_name          Content
 0   or       Alpha             boy or girl
 1    @      Special            school @ morn
 2   lol ok  Alpha              pyc LoL ok student
 3  NaN      Other              Chandra

someone help me with this

7
  • Sorry, what's your expected output? Commented Nov 6, 2017 at 6:57
  • output_DF is my expected output Commented Nov 6, 2017 at 6:59
  • if possible pls suggest a new logic Commented Nov 6, 2017 at 7:00
  • See either of our answers, they both do essentially the same thing. Commented Nov 6, 2017 at 7:21
  • yes,sure @bharath, coldspeed (y) Commented Nov 6, 2017 at 7:26

2 Answers 2

2

Melt the dataframe 1 and convert it to dict. Then get the values of dict based on key obtained by pattern match in dataframe 2 i.e

vals = df.melt()
di = dict(zip(vals['value'],vals['variable']))
# {'or': 'Alpha', 1: 'Numeric', 2: 'Numeric', 'and': 'Alpha', 4: 'Numeric', '@': 'Special', '#': 'Special', '&': 'Special', 'Special': 'new', 'Alpha': 'new', 'lol ok': 'Alpha'}

#Create a regex pattern based on dict keys. 
pat = '|'.join(r"\s{}\s".format(x) for x in di.keys())

#Find the words that match the pattern 
df2['val'] = df2['Content'].str.lower().str.findall(pat).apply(lambda x : x[0].strip() if len(x)>=1 else np.nan)

# Map the values with di and fill nan with other. 
df2['new'] = df2['val'].map(di).fillna('other')

Ouptut :

             Content      new     val
0         boy or girl    Alpha      or
1       school @ morn  Special       @
2  pyc LoL ok student    Alpha  lol ok
3             Chandra    other     NaN
Sign up to request clarification or add additional context in comments.

6 Comments

I cannot melt my dataframe, I think the syntax should be pd.melt(df), can you confirm and edit the same in the solution
I am getting this error, TypeError: object of type 'float' has no len() on df2['val'] = df2[content_col].str.lower().str.findall(pat).apply(lambda x : x[0].strip() if len(x)>=1 else np.nan)
Which pandas version are you using?
pandas version is '0.19.2' , I did filledna() but it affects in other ways
Update ur pandas version and try again
|
1

Use str.cat + str.extract. Then, use map for the column names, and pd.concat to join.

i = df.stack().astype(str)
j = i.reset_index(level=0, drop=1)

m = dict(zip(j.values, j.index))
v = i.str.cat(sep='|')  

df2['val'] = df2.Content.str.extract(r'\s(' + v + r')\s', flags=re.I, expand=False)
df2['col_name'] = df2['val'].str.lower().map(m).fillna('Other')

df2

              Content     val col_name
0         boy or girl      or    Alpha
1       school @ morn       @  Special
2  pyc LoL ok student  LoL ok    Alpha
3             Chandra     NaN    Other

Details

  • i and j are setup variable to create the mapping
  • m is the mapping of values to column names
  • v is the regex pattern that is sent to str.extract for keyword extraction. I use re.I to ignore the case
  • I use map + fillna to map extracted values to column names via m
v
'and|1|@|or|2|#|lol ok|4|&'

m
{'#': 'Special',
 '&': 'Special',
 '1': 'Numeric',
 '2': 'Numeric',
 '4': 'Numeric',
 '@': 'Special',
 'and': 'Alpha',
 'lol ok': 'Alpha',
 'or': 'Alpha'}


df['val']

0        or
1         @
2    LoL ok
3       NaN
Name: val, dtype: object

df['col_name']

0      Alpha
1    Special
2      Alpha
3      Other
Name: col_name, dtype: object

6 Comments

@CHANDRAMURUGAN Check the top of my answer, it's there.
@cᴏʟᴅsᴘᴇᴇᴅ so many new terms in the code. You need to update it. I think its df2['val'] = and also not jus val.str maybe df2['val'l.str.
@Bharath Oh man, there were a ton of those bugs, but I've fixed them.
Cheers, me too :)
@coldspped, tried your solution its is not working when the mapping keyword present in the end like if the keyword in df is "displayed" and in df2["content"]="Policies not getting renewed displayed" its going under other category
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.