Using regex to recognize patterns and clean data in pandas columns

Question

I have a dataset that contains innovation data of companies, by using some regex, I want to retrieve the licence data

company licences/patents
1       UX226, licence-pp-zz, licence-zz-pp, licence-xx-tt
2       VV3346E, SS345
3       licence-dd-zz
4       UT223, licence, ss
5       XBTYU, licence-tt-kk, licence-ss-tt
6       xc, zz
7       licence-xb-xz

Desired output:

company licences/patents                                    licence
1       UX226, licence-pp-zz, licence-zz-pp, licence-xx-tt  licence-pp-zz, licence-zz-pp, licence-xx-tt 
2       VV3346E, SS345
3       licence-dd-zz                                       licence-dd-zz
4       UT223, licence, ss
5       XBTYU, licence-tt-kk, licence-ss-tt                 licence-tt-kk, licence-ss-tt
6       xc, zz
7       licence-xb-xz                                       licence-xb-xz

Scott Boston · Accepted Answer · 2019-04-21 02:52:57Z

You can try:

df['licence'] = df['licences/patents'].str.extractall('(licence-\w{2}-\w{2})')\
  .unstack().apply(lambda x: ', '.join(x.dropna()), axis=1)

Output:

   company                                   licences/patents                                      licence
0        1  UX226, licence-pp-zz, licence-zz-pp, licence-x...  licence-pp-zz, licence-zz-pp, licence-xx-tt
1        2                                     VV3346E, SS345                                          NaN
2        3                                      licence-dd-zz                                licence-dd-zz
3        4                                 UT223, licence, ss                                          NaN
4        5                XBTYU, licence-tt-kk, licence-ss-tt                 licence-tt-kk, licence-ss-tt
5        6                                             xc, zz                                          NaN
6        7                                      licence-xb-xz                                licence-xb-xz

Chris Adams · Accepted Answer · 2019-04-21 10:09:26Z

Another approach, using Series.str.findall and Series.str.join:

df['licence'] = df['licences/patents'].str.findall(r'(licence[^,]*)').str.join(', ')

[out]

   company                                   licences/patents  \
0        1  UX226, licence-pp-zz, licence-zz-pp, licence-x...   
1        2                                     VV3346E, SS345   
2        3                                      licence-dd-zz   
3        4                                 UT223, licence, ss   
4        5                XBTYU, licence-tt-kk, licence-ss-tt   
5        6                                             xc, zz   
6        7                                      licence-xb-xz   

                                       licence  
0  licence-pp-zz, licence-zz-pp, licence-xx-tt  
1                                               
2                                licence-dd-zz  
3                                      licence  
4                 licence-tt-kk, licence-ss-tt  
5                                               
6                                licence-xb-xz

Emma Marcier · Accepted Answer · 2019-04-21 02:50:16Z

1

This RegEx might help you to create one group, $1, where your desired output may be:

(licence-[a-z]{2}-[a-z]{2})

answered Apr 21, 2019 at 2:50

Emma Marcier

27.8k12 gold badges49 silver badges71 bronze badges

Comments

r_hudson · Accepted Answer · 2019-04-21 14:10:39Z

1

df['license'] = df['licences/patents'].apply(lambda x: ''.join(re.findall('lice.*',x)))

It will create new column license with licenses stripped from original column.

edited Apr 21, 2019 at 14:10

answered Apr 21, 2019 at 5:20

r_hudson

1938 bronze badges

Comments

Rajat Jain · Accepted Answer · 2019-04-21 17:25:01Z

1

Try below code:

df['licences/patents'].str.findall('(licence-[\w\-]+)').apply(", ".join)

answered Apr 21, 2019 at 17:25

Rajat Jain

2,0422 gold badges17 silver badges23 bronze badges

Collectives™ on Stack Overflow

Using regex to recognize patterns and clean data in pandas columns

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related