Parse dataframe column for substring and return value

Question

I have dataframe with a column called 'INFO' containing a string delimited by a semicolon like below. I would like to parse the string to return values corresponding to specific substrings and assign it to a new column.

EX. In the table below, I would want to assign the value (after the equals sign) for 'CLNDISDB' to a column of the same name.

I have tried:

df['INFO'].str.split(';',expand=True)

which assigned each string part to a new column but that would still require me to parse each column for the necessary string. Any help would be greatly appreciated.

I can only use python standard libraries


|    | INFO                                              |
|----|---------------------------------------------------|
| 0  | AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;... |
| 1  | AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926... |
| 2  | ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:... |
| 3  | ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:... |
| 4  | AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;... |
| 5  | AF_ESP=0.40158;AF_EXAC=0.37025;AF_TGP=0.33886;... |
| 6  | ALLELEID=556509;CLNDISDB=MedGen:C4015293,OMIM:... |
| 7  | ALLELEID=556512;CLNDISDB=MedGen:C4015293,OMIM:... |
| 8  | ALLELEID=171289;CLNDISDB=MedGen:C4015293,OMIM:... |
| 9  | AF_EXAC=0.00001;ALLELEID=171288;CLNDISDB=MedGe... |
| 10 | AF_ESP=0.00038;AF_EXAC=0.00036;AF_TGP=0.00060;... |
| 11 | AF_ESP=0.00987;AF_EXAC=0.00772;AF_TGP=0.01558;... |
| 12 | AF_ESP=0.00070;AF_TGP=0.00080;ALLELEID=446981;... |
| 13 | AF_EXAC=0.00038;ALLELEID=446941;CLNDISDB=MedGe... |
| 14 | AF_EXAC=0.00114;AF_TGP=0.00958;ALLELEID=364282... |
| 15 | ALLELEID=556516;CLNDISDB=MedGen:C3808739,OMIM:... |
| 16 | AF_EXAC=0.00024;ALLELEID=364148;CLNDISDB=MedGe... |
| 17 | ALLELEID=514900;CLNDISDB=MedGen:C3808739,OMIM:... |
| 18 | AF_EXAC=0.42418;AF_TGP=0.28255;ALLELEID=133759... |
| 19 | ALLELEID=364285;CLNDISDB=MedGen:CN169374;CLNDN... |

Paul Brennan · Accepted Answer · 2020-12-14 04:28:42Z

This was fun, so setting up the dataframe

import re
import pandas as pd
import numpy as np

test={'INFO':['AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759', 'foo','AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759']}
df=pd.DataFrame(test)

we get

    INFO
0   AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759
1   foo
2   AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759

Now using a lambda function and a conditional we just extract the values for "AF_TGP". Setting up a new column, we extract the number from the regex pattern that surrounds the value. But we only run the regex when it will pass the text, using the in function in python.

df['a'] = df['INFO'].apply(lambda x: re.search(r'AF_TGP=(?P<number>[+-]?\d+\.\d+);', str(x)).group('number') if 'AF_TGP' in str(x) else np.nan)

this gives the result

    INFO                                            a
0   AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759 -0.28255
1   foo                                             NaN
2   AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759 0.28255

savvysavage32 · Accepted Answer · 2020-12-14 03:50:26Z

0

Figured it out. Works well, have to clean up the results a bit and remove an extra space and comma.

def my_func(x):
    return  [value.split("CLNDISDB=",1) for value in x.split(';') if 'CLNDISDB' in value]

df['CLNDISDB'] = [my_func(x) for x in df['INFO'].values]

answered Dec 14, 2020 at 3:50

savvysavage32

718 bronze badges

Collectives™ on Stack Overflow

Parse dataframe column for substring and return value

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related