1

I have dataframe with a column called 'INFO' containing a string delimited by a semicolon like below. I would like to parse the string to return values corresponding to specific substrings and assign it to a new column.

EX. In the table below, I would want to assign the value (after the equals sign) for 'CLNDISDB' to a column of the same name.

I have tried:

df['INFO'].str.split(';',expand=True)

which assigned each string part to a new column but that would still require me to parse each column for the necessary string. Any help would be greatly appreciated.

I can only use python standard libraries


|    | INFO                                              |
|----|---------------------------------------------------|
| 0  | AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;... |
| 1  | AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926... |
| 2  | ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:... |
| 3  | ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:... |
| 4  | AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;... |
| 5  | AF_ESP=0.40158;AF_EXAC=0.37025;AF_TGP=0.33886;... |
| 6  | ALLELEID=556509;CLNDISDB=MedGen:C4015293,OMIM:... |
| 7  | ALLELEID=556512;CLNDISDB=MedGen:C4015293,OMIM:... |
| 8  | ALLELEID=171289;CLNDISDB=MedGen:C4015293,OMIM:... |
| 9  | AF_EXAC=0.00001;ALLELEID=171288;CLNDISDB=MedGe... |
| 10 | AF_ESP=0.00038;AF_EXAC=0.00036;AF_TGP=0.00060;... |
| 11 | AF_ESP=0.00987;AF_EXAC=0.00772;AF_TGP=0.01558;... |
| 12 | AF_ESP=0.00070;AF_TGP=0.00080;ALLELEID=446981;... |
| 13 | AF_EXAC=0.00038;ALLELEID=446941;CLNDISDB=MedGe... |
| 14 | AF_EXAC=0.00114;AF_TGP=0.00958;ALLELEID=364282... |
| 15 | ALLELEID=556516;CLNDISDB=MedGen:C3808739,OMIM:... |
| 16 | AF_EXAC=0.00024;ALLELEID=364148;CLNDISDB=MedGe... |
| 17 | ALLELEID=514900;CLNDISDB=MedGen:C3808739,OMIM:... |
| 18 | AF_EXAC=0.42418;AF_TGP=0.28255;ALLELEID=133759... |
| 19 | ALLELEID=364285;CLNDISDB=MedGen:CN169374;CLNDN... |

2 Answers 2

1

This was fun, so setting up the dataframe

import re
import pandas as pd
import numpy as np

test={'INFO':['AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759', 'foo','AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759']}
df=pd.DataFrame(test)

we get

    INFO
0   AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759
1   foo
2   AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759

Now using a lambda function and a conditional we just extract the values for "AF_TGP". Setting up a new column, we extract the number from the regex pattern that surrounds the value. But we only run the regex when it will pass the text, using the in function in python.

df['a'] = df['INFO'].apply(lambda x: re.search(r'AF_TGP=(?P<number>[+-]?\d+\.\d+);', str(x)).group('number') if 'AF_TGP' in str(x) else np.nan)

this gives the result

    INFO                                            a
0   AF_EXAC=0.42418;AF_TGP=-0.28255;ALLELEID=133759 -0.28255
1   foo                                             NaN
2   AF_EXAC=-0.42418;AF_TGP=0.28255;ALLELEID=133759 0.28255
Sign up to request clarification or add additional context in comments.

Comments

0

Figured it out. Works well, have to clean up the results a bit and remove an extra space and comma.

def my_func(x):
    return  [value.split("CLNDISDB=",1) for value in x.split(';') if 'CLNDISDB' in value]

df['CLNDISDB'] = [my_func(x) for x in df['INFO'].values]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.