You can achieve it via pandas explode and apply methods.
Let's create a sample dataframe first.
df1 = pd.DataFrame.from_dict({'term description': ['metabolic process', 'organic substance metabolic process', 'metabolic process a', 'metabolic process b', 'metabolic process c'],
'false discovery rate': [1.01, 1.001, 1.02, 1.03, 1.04],
'proteins': ['CYP51A1,CPA1,STK10', 'CYP51A1,CPA1,AAA', 'CPA1,AAA,BBB,CCC', 'AAA,BBB,CCC,DDD', 'AAA,CCC,EEE,FFF']
})
# dataframe df1
term description false discovery rate proteins
0 metabolic process 1.010 CYP51A1,CPA1,STK10
1 organic substance metabolic process 1.001 CYP51A1,CPA1,AAA
2 metabolic process a 1.020 CPA1,AAA,BBB,CCC
3 metabolic process b 1.030 AAA,BBB,CCC,DDD
4 metabolic process c 1.040 AAA,CCC,EEE,FFF
Let's split the proteins column to a list, so that we can explode it.
df1['proteins'] = df1['proteins'].apply(lambda x: x.split(','))
df1 = df1.explode('proteins')
# dataframe df1
term description false discovery rate proteins
0 metabolic process 1.010 CYP51A1
0 metabolic process 1.010 CPA1
0 metabolic process 1.010 STK10
1 organic substance metabolic process 1.001 CYP51A1
1 organic substance metabolic process 1.001 CPA1
1 organic substance metabolic process 1.001 AAA
2 metabolic process a 1.020 CPA1
2 metabolic process a 1.020 AAA
2 metabolic process a 1.020 BBB
2 metabolic process a 1.020 CCC
3 metabolic process b 1.030 AAA
3 metabolic process b 1.030 BBB
3 metabolic process b 1.030 CCC
3 metabolic process b 1.030 DDD
4 metabolic process c 1.040 AAA
4 metabolic process c 1.040 CCC
4 metabolic process c 1.040 EEE
4 metabolic process c 1.040 FFF
Now we'll combine the values under 'term description' that belongs to the same protein.
df2 = df1.groupby('proteins')['term description'].apply(list).reset_index()
# dataframe df2
proteins term description
0 AAA [organic substance metabolic process, metaboli...
1 BBB [metabolic process a, metabolic process b]
2 CCC [metabolic process a, metabolic process b, met...
3 CPA1 [metabolic process, organic substance metaboli...
4 CYP51A1 [metabolic process, organic substance metaboli...
5 DDD [metabolic process b]
6 EEE [metabolic process c]
7 FFF [metabolic process c]
8 STK10 [metabolic process]
Now, all we need to do is to apply a lambda that'd modify the 'proteins' column values as per our requirements. I'm adding a sample one based on what you mentioned. You can add multiple conditions inside this method as you need.
def modifier(protein, term_descrip):
if protein == 'CYP51A1' and set(term_descrip).intersection({'metabolic process', 'organic substance metabolic process'}):
return 'CYP51A1 etc.'
# add more if conditions as required
df2['proteins'] = df2.apply(lambda row: modifier(row['proteins'], row['term description']), axis=1)
# dataframe df2
proteins term description
0 None [organic substance metabolic process, metaboli...
1 None [metabolic process a, metabolic process b]
2 None [metabolic process a, metabolic process b, met...
3 None [metabolic process, organic substance metaboli...
4 CYP51A1 etc. [metabolic process, organic substance metaboli...
5 None [metabolic process b]
6 None [metabolic process c]
7 None [metabolic process c]
8 None [metabolic process]