2

I have a column in a df contains the following values:

>>> import pandas as pd
>>> df = pd.DataFrame({'Sentence':['his is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm', 'I have researched the product KEY_abc_def, and KEY_blt_chm as requested', 'He got the idea from your message KEY_mno_pqr']})
>>> df
                                                Sentence
0       This is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm
1  I have researched the product KEY_abc_def, and KEY_blt_chm as requested
2            He got the idea from your message KEY_mno_pqr

I would like to use regex to extract the KEY into a new column without the actual "KEY_". For those sentences have more than 1 KEY, they should be joined with a comma. The output should be as below:

>>> df
                                                Sentence                               KEY
0      This is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm    abc_def, mno_pqr, blt_chm
1  I have researched the product KEY_abc_def, and KEY_blt_chm as requested          abc_def, blt_chm     
2           He got the idea from your message KEY_mno_pqr                           mno_pqr  

I tried with this code but it is not working. Any suggestions would greatly be appreciated.

The code that I currently have only worked with the first KEY, and ignored the rest. I'm new with regex so any suggestions would be highly appreciated.

df['KEY']= df.sentence.str.extract("KEY_(\w+)", expand=True)
1
  • I bet ANTLR is a better approach for whatever you want to do Commented Nov 26, 2020 at 10:18

1 Answer 1

4

Use

df['KEY']= df.sentence.str.findall("KEY_(\w+)").str.join(",")

The Series.str.findall finds all occurrences of the captured substring and str.join(",") joins the results into a comma-separated string value.

Pandas test:

>>> df['KEY']= df['Sentence'].str.findall("KEY_(\w+)").str.join(",")
>>> df
                                                                   Sentence                      KEY
0  his is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm  abc_def,mno_pqr,blt_chm
1   I have researched the product KEY_abc_def, and KEY_blt_chm as requested          abc_def,blt_chm
2                             He got the idea from your message KEY_mno_pqr                  mno_pqr

(Note in case you did not know that: I used pd.set_option('display.max_colwidth', None) to display all the data in the columns, see How to display full (non-truncated) dataframe information in html when converting from pandas dataframe to html?).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.