I've two dataframes, one with text information and another with regex and patterns, what I need to do is to map a column from the second dataframe using regex
edit: What I need to do is to apply each regex on all df['text'] rows, and if there is a match, add the Pattern into a new column
Sample data
text_dict = {'text':['customer and increased repair and remodel activity as well as from other sales',
'sales for the overseas customers',
'marketing approach is driving strong play from top tier customers',
'employees in India have been the continuance of remote work will impact productivity',
'sales due to higher customer']}
regex_dict = {'Pattern':['Sales + customer', 'Marketing + customer', 'Employee * Productivity'],
'regex': ['(?:sales\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:sales\\w*)',
'(?:marketing\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:marketing\\w*)',
'(?:employee\\w*)(?:[^\n])*(?:productivity\\w*)|(?:productivity\\w*)(?:[^\n])*(?:employee\\w*)']}
df
text
0 customer and increased repair and remodel acti...
1 sales for the overseas customers
2 marketing approach is driving strong play from...
3 employees in India have been the continuance o...
4 sales due to higher customer
regex
Pattern regex
0 Sales + customer (?:sales\w*)(?:[^,.?])*(?:customer\w*)|(?:cust...
1 Marketing + customer (?:marketing\w*)(?:[^,.?])*(?:customer\w*)|(?:...
2 Employee * Productivity (?:employee\w*)(?:[^\n])*(?:productivity\w*)|(...
Desired output
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer
tried the following, created a function that returns the Pattern in case there is a match, then I iterate over all the columns in the regex dataframe
def finding_keywords(regex, match, keyword):
if re.search(regex, match):
return keyword
else:
pass
for index, row in regex.iterrows():
df['Pattern'] = df['text'].apply(lambda x: finding_keywords(regex['Regex'][index], x, regex['Pattern'][index]))
the problem with this is that in every iteration, it erases the previous mappings, as you can see below. As I'm foo foo was the last iteration, is the only one remaining with a pattern
text Pattern
0 foo None
1 bar None
2 foo foo I'm foo foo
3 foo bar None
4 bar bar None
One solution could be to run the iteration over regex dataframe, and then iterate over df, this way I avoid loosing information, but I'm looking for a fastest solution