0

I have a pandas dataframe that I want to manipulate. Here's an example of the data:

enter image description here

As you can see we have 3 columns. Column proteins has multiple elements separated with a comma, whereas column term description only has a single element per row. My aim is to reverse this and have a column with single elements from proteins and another column with multiple elements from term description. To explain this by an example if proteins CYP51A1 fall under the term description metabolic process and organic substance metabolic process I want my dataframe to look like this:

protein_name | term description
---------------------------------------------------------------------
CYP51A1      | metabolic process, organic substance metabolic process
etc.

i hope i explained this good enough! thanks for your help!

1 Answer 1

2

You can achieve it via pandas explode and apply methods.

Let's create a sample dataframe first.

df1 = pd.DataFrame.from_dict({'term description': ['metabolic process', 'organic substance metabolic process', 'metabolic process a', 'metabolic process b', 'metabolic process c'],
                         'false discovery rate': [1.01, 1.001, 1.02, 1.03, 1.04],
                         'proteins': ['CYP51A1,CPA1,STK10', 'CYP51A1,CPA1,AAA', 'CPA1,AAA,BBB,CCC', 'AAA,BBB,CCC,DDD', 'AAA,CCC,EEE,FFF']
                        })

# dataframe df1
    term description                    false discovery rate    proteins
0   metabolic process                   1.010                   CYP51A1,CPA1,STK10
1   organic substance metabolic process 1.001                   CYP51A1,CPA1,AAA
2   metabolic process a                 1.020                   CPA1,AAA,BBB,CCC
3   metabolic process b                 1.030                   AAA,BBB,CCC,DDD
4   metabolic process c                 1.040                   AAA,CCC,EEE,FFF

Let's split the proteins column to a list, so that we can explode it.

df1['proteins'] = df1['proteins'].apply(lambda x: x.split(','))
df1 = df1.explode('proteins')

# dataframe df1         
    term description                    false discovery rate    proteins
0   metabolic process                   1.010                   CYP51A1
0   metabolic process                   1.010                   CPA1
0   metabolic process                   1.010                   STK10
1   organic substance metabolic process 1.001                   CYP51A1
1   organic substance metabolic process 1.001                   CPA1
1   organic substance metabolic process 1.001                   AAA
2   metabolic process a                 1.020                   CPA1
2   metabolic process a                 1.020                   AAA
2   metabolic process a                 1.020                   BBB
2   metabolic process a                 1.020                   CCC
3   metabolic process b                 1.030                   AAA
3   metabolic process b                 1.030                   BBB
3   metabolic process b                 1.030                   CCC
3   metabolic process b                 1.030                   DDD
4   metabolic process c                 1.040                   AAA
4   metabolic process c                 1.040                   CCC
4   metabolic process c                 1.040                   EEE
4   metabolic process c                 1.040                   FFF

Now we'll combine the values under 'term description' that belongs to the same protein.

df2 = df1.groupby('proteins')['term description'].apply(list).reset_index()

# dataframe df2
    proteins    term description
0   AAA         [organic substance metabolic process, metaboli...
1   BBB         [metabolic process a, metabolic process b]
2   CCC         [metabolic process a, metabolic process b, met...
3   CPA1        [metabolic process, organic substance metaboli...
4   CYP51A1     [metabolic process, organic substance metaboli...
5   DDD         [metabolic process b]
6   EEE         [metabolic process c]
7   FFF         [metabolic process c]
8   STK10       [metabolic process]

Now, all we need to do is to apply a lambda that'd modify the 'proteins' column values as per our requirements. I'm adding a sample one based on what you mentioned. You can add multiple conditions inside this method as you need.

def modifier(protein, term_descrip):
    if protein == 'CYP51A1' and set(term_descrip).intersection({'metabolic process', 'organic substance metabolic process'}):
        return 'CYP51A1 etc.'
    # add more if conditions as required

df2['proteins'] = df2.apply(lambda row: modifier(row['proteins'], row['term description']), axis=1)

# dataframe df2
    proteins        term description
0   None            [organic substance metabolic process, metaboli...
1   None            [metabolic process a, metabolic process b]
2   None            [metabolic process a, metabolic process b, met...
3   None            [metabolic process, organic substance metaboli...
4   CYP51A1 etc.    [metabolic process, organic substance metaboli...
5   None            [metabolic process b]
6   None            [metabolic process c]
7   None            [metabolic process c]
8   None            [metabolic process]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.