2

I have a dataset that contains innovation data of companies, by using some regex, I want to retrieve the licence data

company licences/patents
1       UX226, licence-pp-zz, licence-zz-pp, licence-xx-tt
2       VV3346E, SS345
3       licence-dd-zz
4       UT223, licence, ss
5       XBTYU, licence-tt-kk, licence-ss-tt
6       xc, zz
7       licence-xb-xz

Desired output:

company licences/patents                                    licence
1       UX226, licence-pp-zz, licence-zz-pp, licence-xx-tt  licence-pp-zz, licence-zz-pp, licence-xx-tt 
2       VV3346E, SS345
3       licence-dd-zz                                       licence-dd-zz
4       UT223, licence, ss
5       XBTYU, licence-tt-kk, licence-ss-tt                 licence-tt-kk, licence-ss-tt
6       xc, zz
7       licence-xb-xz                                       licence-xb-xz
0

5 Answers 5

2

You can try:

df['licence'] = df['licences/patents'].str.extractall('(licence-\w{2}-\w{2})')\
  .unstack().apply(lambda x: ', '.join(x.dropna()), axis=1)

Output:

   company                                   licences/patents                                      licence
0        1  UX226, licence-pp-zz, licence-zz-pp, licence-x...  licence-pp-zz, licence-zz-pp, licence-xx-tt
1        2                                     VV3346E, SS345                                          NaN
2        3                                      licence-dd-zz                                licence-dd-zz
3        4                                 UT223, licence, ss                                          NaN
4        5                XBTYU, licence-tt-kk, licence-ss-tt                 licence-tt-kk, licence-ss-tt
5        6                                             xc, zz                                          NaN
6        7                                      licence-xb-xz                                licence-xb-xz
Sign up to request clarification or add additional context in comments.

Comments

2

Another approach, using Series.str.findall and Series.str.join:

df['licence'] = df['licences/patents'].str.findall(r'(licence[^,]*)').str.join(', ')

[out]

   company                                   licences/patents  \
0        1  UX226, licence-pp-zz, licence-zz-pp, licence-x...   
1        2                                     VV3346E, SS345   
2        3                                      licence-dd-zz   
3        4                                 UT223, licence, ss   
4        5                XBTYU, licence-tt-kk, licence-ss-tt   
5        6                                             xc, zz   
6        7                                      licence-xb-xz   

                                       licence  
0  licence-pp-zz, licence-zz-pp, licence-xx-tt  
1                                               
2                                licence-dd-zz  
3                                      licence  
4                 licence-tt-kk, licence-ss-tt  
5                                               
6                                licence-xb-xz  

Comments

1

This RegEx might help you to create one group, $1, where your desired output may be:

(licence-[a-z]{2}-[a-z]{2})

RegEx

Comments

1

df['license'] = df['licences/patents'].apply(lambda x: ''.join(re.findall('lice.*',x)))

It will create new column license with licenses stripped from original column.

Comments

1

Try below code:

df['licences/patents'].str.findall('(licence-[\w\-]+)').apply(", ".join)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.