0

I have two dataframes as follows,

import pandas as pd
df = pd.DataFrame({'text':['I go to school','open the green door', 'go out and play'],
               'pos':[['PRON','VERB','ADP','NOUN'],['VERB','DET','ADJ','NOUN'],['VERB','ADP','CCONJ','VERB']]})

df2 = pd.DataFrame({'verbs':['go','open','close','share','divide'],
                   'new_verbs':['went','opened','closed','shared','divided']})

I would like to replace the verbs in df.text with their past form in df2.new_verbs if the verbs are found in df2.verbs. and so far I have done the following,

df['text'] = df['text'].str.split()
new_df = df.apply(pd.Series.explode)
new_df = new_df.assign(new=lambda d: d['pos'].mask(d['pos'] == 'VERB', d['text']))
new_df.text[new_df.new.isin(df2.verbs)] = df2.new_verbs

but when I print out the result, not all verbs are correctly replaced. My desired output would be,

       text    pos    new
0       I   PRON   PRON
0    went   VERB     go
0      to    ADP    ADP
0  school   NOUN   NOUN
1  opened   VERB   open
1     the    DET    DET
1   green    ADJ    ADJ
1    door   NOUN   NOUN
2    went   VERB     go
2     out    ADP    ADP
2     and  CCONJ  CCONJ
2    play   VERB   play

2 Answers 2

3

You can use a regex for that:

import re
regex = '|'.join(map(re.escape, df2['verbs']))
s = df2.set_index('verbs')['new_verbs']

df['text'] = df['text'].str.replace(regex, lambda m: s.get(m.group(), m),
                                    regex=True)

output (here as column text2 for clarity):

                  text                       pos                  text2
0       I go to school   [PRON, VERB, ADP, NOUN]       I went to school
1  open the green door    [VERB, DET, ADJ, NOUN]  opened the green door
2      go out and play  [VERB, ADP, CCONJ, VERB]      went out and play
Sign up to request clarification or add additional context in comments.

Comments

1

For smaller lists, you can use pandas replace and a dictionary like this:

verbs_map = dict(zip(df2.verbs, df2.new_verbs))
new_df.text.replace(verbs_map)

Basically, dict(zip(df2.verbs, df2.new_verbs) creates a new dictionary that maps old verbs to their new (past tense) verbs, e.g. {'go' : 'went' , 'close' : 'closed', ...}.

4 Comments

This solution will become very slow if there are many verbs to replace. I think replace with a dictionary runs as many times as there are items.
Yes, ideally the dictionary should be created once at the start but apart from that the lookups are fast, I'll change the code to reflect that.
no, I meant due to how replace works, it checks items one after the other
@mozway yeah fair enough, I'll update my answer to say that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.