1

I've two dataframes, one with text information and another with regex and patterns, what I need to do is to map a column from the second dataframe using regex

edit: What I need to do is to apply each regex on all df['text'] rows, and if there is a match, add the Pattern into a new column

Sample data

text_dict = {'text':['customer and increased repair and remodel activity as well as from other sales',
             'sales for the overseas customers',
             'marketing approach is driving strong play from top tier customers',
             'employees in India have been the continuance of remote work will impact productivity',
             'sales due to higher customer']}

regex_dict = {'Pattern':['Sales + customer', 'Marketing + customer', 'Employee * Productivity'],
             'regex': ['(?:sales\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:sales\\w*)',
                       '(?:marketing\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:marketing\\w*)',
                       '(?:employee\\w*)(?:[^\n])*(?:productivity\\w*)|(?:productivity\\w*)(?:[^\n])*(?:employee\\w*)']}

df

                                                text
0  customer and increased repair and remodel acti...
1                   sales for the overseas customers
2  marketing approach is driving strong play from...
3  employees in India have been the continuance o...
4                       sales due to higher customer

regex

                   Pattern                                              regex
0         Sales + customer  (?:sales\w*)(?:[^,.?])*(?:customer\w*)|(?:cust...
1     Marketing + customer  (?:marketing\w*)(?:[^,.?])*(?:customer\w*)|(?:...
2  Employee * Productivity  (?:employee\w*)(?:[^\n])*(?:productivity\w*)|(...

Desired output

                                                text    Pattern
0  customer and increased repair and remodel acti...    Sales + customer
1                   sales for the overseas customers    Sales + customer
2  marketing approach is driving strong play from...    Marketing + customer
3  employees in India have been the continuance o...    Employee * Productivity
4                       sales due to higher customer    Sales + customer

tried the following, created a function that returns the Pattern in case there is a match, then I iterate over all the columns in the regex dataframe

def finding_keywords(regex, match, keyword):
    if re.search(regex, match):
        return keyword
    else:
        pass

for index, row in regex.iterrows():
    df['Pattern'] = df['text'].apply(lambda x: finding_keywords(regex['Regex'][index], x, regex['Pattern'][index]))

the problem with this is that in every iteration, it erases the previous mappings, as you can see below. As I'm foo foo was the last iteration, is the only one remaining with a pattern

      text      Pattern
0      foo         None
1      bar         None
2  foo foo  I'm foo foo
3  foo bar         None
4  bar bar         None

One solution could be to run the iteration over regex dataframe, and then iterate over df, this way I avoid loosing information, but I'm looking for a fastest solution

2
  • It is not clear what you are trying to do, so kindly offer a clear explanation of the problem. Also, please share the code you have tried. Commented Oct 27, 2020 at 21:54
  • I updated the description with more clarity on what I'm trying to achieve Commented Oct 27, 2020 at 22:07

1 Answer 1

1

You can loop through the unique values of the regex dataframe and apply to the text of the df frame and return the pattern in a new regex column. Then, merge in the Pattern column and drop the regex column.

The key to my approach was to first create the column as NaN and then fillna with each iteration so the columns didn't get overwritten.

import re
import numpy as np

srs = regex['regex'].unique()
df['regex'] = np.nan

for reg in srs:
    df['regex'] = df['regex'].fillna(df['text'].apply(lambda x: reg 
                               if re.search(reg, x) else np.NaN))

df = pd.merge(df, regex, how='left', on='regex').drop('regex', axis=1)

df

Out[1]: 
                                                text                  Pattern
0  customer and increased repair and remodel acti...         Sales + customer
1                   sales for the overseas customers         Sales + customer
2  marketing approach is driving strong play from...     Marketing + customer
3  employees in India have been the continuance o...  Employee * Productivity
4                       sales due to higher customer         Sales + customer
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.