0

I have this dataframe in python df

    uniprot_id(PK)                                       protein_name  ... protein_family protein_subfamily
0           Q8TAS1              Serine/threonine-protein kinase Kist   ...            KIS               NaN
1           P35916     Vascular endothelial growth factor receptor 3   ...          VEGFR               NaN
2           Q96SB4                             SRSF protein kinase 1   ...           SRPK               NaN
3           Q6P3W7                               SCY1-like protein 2   ...           SCY1               NaN
4           Q9UKI8    Serine/threonine-protein kinase tousled-like 1   ...            TLK               NaN
5           P30291                          Wee1-like protein kinase   ...            WEE               NaN
6           Q15120                            Pyruvate dehydrogenase   ...           PDHK               NaN
7           Q7L7X3              Serine/threonine-protein kinase TAO1   ...          STE20               TAO
8           O75385              Serine/threonine-protein kinase ULK1   ...            ULK               NaN
9           P08922        Proto-oncogene tyrosine-protein kinase ROS   ...            Sev               NaN
10          Q9P289                Serine/threonine-protein kinase 26   ...          STE20               YSK
11          Q9NRP7                Serine/threonine-protein kinase 36   ...            ULK               NaN
12          Q9C0K7         STE20-related kinase adapter protein beta   ...          STE20              STLK
13          Q8IZX4  Transcription initiation factor TFIID subunit ...  ...           TAF1               NaN
14          Q9UKE5          TRAF2 and NCK-interacting protein kinase   ...          STE20               MSN
15          Q5TCY1                              Tau-tubulin kinase 1   ...           TTBK               NaN
16          P33981               Dual specificity protein kinase TTK   ...            TTK               NaN
17          P07949  Proto-oncogene tyrosine-protein kinase recepto...  ...            Ret               NaN
18          O14730              Serine/threonine-protein kinase RIO3   ...            RIO              RIO3
19          O43353  Receptor-interacting serine/threonine-protein ...  ...           RIPK               NaN
20          P57078  Receptor-interacting serine/threonine-protein ...  ...           RIPK               NaN
21          Q9Y2H1           Serine/threonine-protein kinase 38-like   ...            NDR               NaN
22          Q9UEW8  STE20/SPS1-related proline-alanine-rich protei...  ...          STE20              FRAY
23          Q8TDR2                Serine/threonine-protein kinase 35   ...           NKF4               NaN
24          P49842                Serine/threonine-protein kinase 19   ...            G11               NaN
25          Q13177             Serine/threonine-protein kinase PAK 2   ...          STE20              PAKA
26          B5MCJ9            Tripartite motif-containing protein 66   ...           TIF1               NaN
27          Q6IBK5  Transcription initiation factor IIF subunit alpha  ...         GTF2F1               NaN
28          Q8N165            Serine/threonine-protein kinase PDIK1L   ...           NKF4               NaN
29          Q86YV6         Myosin light chain kinase family member 4   ...           MLCK               NaN
30          Q8TCG2         Phosphatidylinositol 4-kinase type 2-beta   ...            NaN               NaN
31          Q16654                            Pyruvate dehydrogenase   ...           PDHK               NaN
32          P51817  cAMP-dependent protein kinase catalytic subuni...  ...            PKA               NaN
33      A0A0B4J2F2    Putative serine/threonine-protein kinase SIK1B   ...            NaN               NaN
34          P57059              Serine/threonine-protein kinase SIK1   ...          CAMKL               QIK
35          Q9H0K1              Serine/threonine-protein kinase SIK2   ...          CAMKL               QIK
36          Q9Y2K2              Serine/threonine-protein kinase SIK3   ...          CAMKL               QIK
37          Q9BXU1                Serine/threonine-protein kinase 31   ...   Other-Unique               NaN
38          Q13263          Transcription intermediary factor 1-beta   ...           TIF1               NaN
39          Q32MK0                       Myosin light chain kinase 3   ...           MLCK               NaN
40          Q13153             Serine/threonine-protein kinase PAK 1   ...          STE20              PAKA
41          Q16816  Phosphorylase b kinase gamma catalytic chain; ...  ...            PHK               NaN
42          Q05823                       2-5A-dependent ribonuclease   ...   Other-Unique               NaN
43          Q8IWB6    Inactive serine/threonine-protein kinase TEX14   ...           NKF5               NaN
44          Q8IWB6    Inactive serine/threonine-protein kinase TEX14   ...           NKF5               NaN
45          Q9BX84  Transient receptor potential cation channel su...  ...          Alpha              ChaK
46          Q9H1R3  Myosin light chain kinase 2; skeletal/cardiac ...  ...           MLCK               NaN
47          O75116                   Rho-associated protein kinase 2   ...           DMPK              ROCK
48          Q01973  Inactive tyrosine-protein kinase transmembrane...  ...            Ror               NaN
49          O75962                  Triple functional domain protein   ...           Trio               NaN
50          Q9Y4A5  Transformation/transcription domain-associated...  ...           PIKK             TRRAP
51          Q8NEB9  Phosphatidylinositol 3-kinase catalytic subuni...  ...            NaN               NaN
52          Q496M5     Inactive serine/threonine-protein kinase PLK5   ...            NaN               NaN
53          O00444              Serine/threonine-protein kinase PLK4   ...            PLK               NaN
54          Q06418            Tyrosine-protein kinase receptor TYRO3   ...            Axl               NaN
55          Q9Y572  Receptor-interacting serine/threonine-protein ...  ...           RIPK               NaN
56          Q6IQ55                              Tau-tubulin kinase 2   ...           TTBK               NaN
57          Q6PHR2              Serine/threonine-protein kinase ULK3   ...            ULK               NaN
58          P30530              Tyrosine-protein kinase receptor UFO   ...            Axl               NaN
59          Q9Y6S9                Ribosomal protein S6 kinase-like 1   ...           RSKL               NaN
60          Q01974  Tyrosine-protein kinase transmembrane receptor...  ...            Ror               NaN
61          Q15772  Striated muscle preferentially expressed prote...  ...           Trio               NaN
62          Q15772  Striated muscle preferentially expressed prote...  ...           Trio               NaN
63          Q9UHD2              Serine/threonine-protein kinase TBK1   ...            IKK               NaN
64          Q8TEA7  TBC domain-containing protein kinase-like protein  ...           TBCK               NaN
65          Q96PF2  Testis-specific serine/threonine-protein kinas...  ...           TSSK               NaN
66          Q9H792            Inactive tyrosine-protein kinase PEAK1   ...           NKF3               NaN
67          O43930     Putative serine/threonine-protein kinase PRKY   ...            PKA               NaN
68          P0C1S8                        Wee1-like protein kinase 2   ...            WEE               NaN
69          Q96KB5  Lymphokine-activated killer T-cell-originated ...  ...           TOPK               NaN
70          Q9BXA6  Testis-specific serine/threonine-protein kinas...  ...           TSSK               NaN
71          Q96C45              Serine/threonine-protein kinase ULK4   ...            ULK               NaN
72          P29597         Non-receptor tyrosine-protein kinase TYK2   ...            Jak               NaN
73          P29597         Non-receptor tyrosine-protein kinase TYK2   ...           JakB               NaN
74          Q8WZ42                                             Titin   ...           MLCK               NaN
75          Q86UE8    Serine/threonine-protein kinase tousled-like 2   ...            TLK               NaN
76          Q9BXA7  Testis-specific serine/threonine-protein kinas...  ...           TSSK               NaN
77          Q96KG9                    N-terminal kinase-like protein   ...           SCY1               NaN
78          Q9NRH2       SNF-related serine/threonine-protein kinase   ...          CAMKL              SNRK
79          O94768               Serine/threonine-protein kinase 17B   ...           DAPK               NaN
80          O75716                Serine/threonine-protein kinase 16   ...            NAK               NaN
81          Q15831             Serine/threonine-protein kinase STK11   ...          CAMKL               LKB
82          P07947                       Tyrosine-protein kinase Yes   ...            Src               NaN
83          Q8IV63     Inactive serine/threonine-protein kinase VRK3   ...            VRK               NaN
84          P35968     Vascular endothelial growth factor receptor 2   ...          VEGFR               NaN
85          Q99986              Serine/threonine-protein kinase VRK1   ...            VRK               NaN
86          Q9BYP7              Serine/threonine-protein kinase WNK3   ...            WNK               NaN
87          Q96BR1              Serine/threonine-protein kinase Sgk3   ...            SGK               NaN
88          Q9H2G2        STE20-like serine/threonine-protein kinase   ...          STE20               SLK
89          O94804                Serine/threonine-protein kinase 10   ...          STE20               SLK
90          Q9UPN9                E3 ubiquitin-protein ligase TRIM33   ...           TIF1               NaN
91          Q92519                                Tribbles homolog 2   ...           Trbl               NaN
92          Q9UL54              Serine/threonine-protein kinase TAO2   ...          STE20               TAO
93          Q96RU8                                Tribbles homolog 1   ...           Trbl               NaN
94          Q96PN8  Testis-specific serine/threonine-protein kinas...  ...           TSSK               NaN
95          Q9H4A3              Serine/threonine-protein kinase WNK1   ...            WNK               NaN
96          Q6SA08  Testis-specific serine/threonine-protein kinas...  ...           TSSK               NaN
97          P43403                    Tyrosine-protein kinase ZAP-70   ...            Syk               NaN
98          P42681                       Tyrosine-protein kinase TXK   ...            Tec               NaN
99          P17948     Vascular endothelial growth factor receptor 1   ...          VEGFR               NaN
100         P21675   Transcription initiation factor TFIID subunit 1   ...           TAF1               NaN
101         Q02763                           Angiopoietin-1 receptor   ...            Tie               NaN
102         Q96J92              Serine/threonine-protein kinase WNK4   ...            WNK               NaN
103         Q13470         Non-receptor tyrosine-protein kinase TNK1   ...            Ack               NaN
104         Q9Y3S1              Serine/threonine-protein kinase WNK2   ...            WNK               NaN
105         Q86Y07              Serine/threonine-protein kinase VRK2   ...            VRK               NaN
106         Q96RU7                                Tribbles homolog 3   ...           Trbl               NaN
107         Q9NRL2  Bromodomain adjacent to zinc finger domain pro...  ...            BAZ               NaN
108         Q9NSY1                    BMP-2-inducible protein kinase   ...            NAK               NaN
109         Q13131  5-AMP-activated protein kinase catalytic subun...  ...          CAMKL              AMPK
110         Q96QP1                            Alpha-protein kinase 1   ...          Alpha               NaN
111         Q00532                    Cyclin-dependent kinase-like 1   ...           CDKL               NaN
112         P07333   Macrophage colony-stimulating factor 1 receptor   ...          PDGFR               NaN
113         Q13705                          Activin receptor type-2B   ...           STKR             STKR2
114         Q9UIG0                     Tyrosine-protein kinase BAZ1B   ...            BAZ               NaN
115         Q8IWQ3             Serine/threonine-protein kinase BRSK2   ...          CAMKL              BRSK
116         P51813           Cytoplasmic tyrosine-protein kinase BMX   ...            Tec               NaN
117         Q08345  Epithelial discoidin domain-containing recepto...  ...            DDR               NaN
118         Q16832            Discoidin domain-containing receptor 2   ...            DDR               NaN
119         Q8N568             Serine/threonine-protein kinase DCLK2   ...         DCAMKL               NaN
120         O76039                    Cyclin-dependent kinase-like 5   ...           CDKL               NaN
121         P00533                  Epidermal growth factor receptor   ...           EGFR               NaN
122         Q13873        Bone morphogenetic protein receptor type-2   ...           STKR             STKR2
123         P50613                         Cyclin-dependent kinase 7   ...            CDK              CDK7
124         Q9UQB9                                   Aurora kinase C   ...            Aur               NaN
125         P25440                  Bromodomain-containing protein 2   ...            BRD               NaN
126         P51451                       Tyrosine-protein kinase Blk   ...            Src               NaN
127         P29323                          Ephrin type-B receptor 2   ...            Eph               NaN
128         P54764                          Ephrin type-A receptor 4   ...            Eph               NaN
129         Q05397                           Focal adhesion kinase 1   ...            FAK               NaN
130         P11801                Serine/threonine-protein kinase H1   ...            PSK               NaN
131         P23443                Ribosomal protein S6 kinase beta-1   ...            RSK            RSKp70
132         Q96LW2       Ribosomal protein S6 kinase-related protein   ...           RSKR               NaN
133         Q9UK32               Ribosomal protein S6 kinase alpha-6   ...            RSK            RSKp90
134         Q9UK32               Ribosomal protein S6 kinase alpha-6   ...           RSKb              RSKb
135         Q8NB16          Mixed lineage kinase domain-like protein   ...     TKL-Unique               NaN
136         O00750  Phosphatidylinositol 4-phosphate 3-kinase C2 d...  ...            NaN               NaN
137         O60566  Mitotic checkpoint serine/threonine-protein ki...  ...            BUB               NaN
138         Q9UPZ9               Serine/threonine-protein kinase ICK   ...            RCK               NaN
139         O14965                                   Aurora kinase A   ...            Aur               NaN
140         O60885                  Bromodomain-containing protein 4   ...            BRD               NaN
141         Q58F21               Bromodomain testis-specific protein   ...            BRD               NaN
142         Q15131                        Cyclin-dependent kinase 10   ...            CDK             CDK10
143         Q00537                        Cyclin-dependent kinase 17   ...            CDK           PCTAIRE
144         Q8NI60              Atypical kinase COQ8A; mitochondrial   ...           ABC1            ABC1-A
145         Q15303           Receptor tyrosine-protein kinase erbB-4   ...           EGFR               NaN
146         P08069             Insulin-like growth factor 1 receptor   ...           InsR               NaN
147         O15111  Inhibitor of nuclear factor kappa-B kinase sub...  ...            IKK               NaN
148         O14920  Inhibitor of nuclear factor kappa-B kinase sub...  ...            IKK               NaN
149         O43187   Interleukin-1 receptor-associated kinase-like 2   ...           IRAK               NaN
150         Q9Y243         RAC-gamma serine/threonine-protein kinase   ...            Akt               NaN
151         Q04771                           Activin receptor type-1   ...           STKR             STKR1
152         Q7Z695  Uncharacterized aarF domain-containing protein...  ...           ABC1            ABC1-C
153         P16066             Atrial natriuretic peptide receptor 1   ...            RGC               NaN
154         Q8NFD2  Ankyrin repeat and protein kinase domain-conta...  ...           RIPK               NaN
155         Q13535               Serine/threonine-protein kinase ATR   ...           PIKK               ATR
156         P36894       Bone morphogenetic protein receptor type-1A   ...           STKR             STKR1
157         P11274                 Breakpoint cluster region protein   ...            BCR               NaN
158         Q09013                           Myotonin-protein kinase   ...           DMPK               GEK
159         Q13315                         Serine-protein kinase ATM   ...           PIKK               ATM
160         P53004                            Biliverdin reductase A   ...          BLVRA               NaN
161         O43683  Mitotic checkpoint serine/threonine-protein ki...  ...            BUB               NaN
162         P10398             Serine/threonine-protein kinase A-Raf   ...            RAF               NaN
163         P20594             Atrial natriuretic peptide receptor 2   ...            RGC               NaN
164         P35626                 Beta-adrenergic receptor kinase 2   ...            GRK              BARK
165         P49761              Dual specificity protein kinase CLK3   ...            CLK               NaN
166         P24941                         Cyclin-dependent kinase 2   ...            CDK              CDK2
167         P50750                         Cyclin-dependent kinase 9   ...            CDK              CDK9
168         Q07002                        Cyclin-dependent kinase 18   ...            CDK           PCTAIRE
169         P29320                          Ephrin type-A receptor 3   ...            Eph               NaN
170         P54762                          Ephrin type-B receptor 1   ...            Eph               NaN
171         P22455               Fibroblast growth factor receptor 4   ...           FGFR               NaN
172         P31751          RAC-beta serine/threonine-protein kinase   ...            Akt               NaN
173         Q15059                  Bromodomain-containing protein 3   ...            BRD               NaN
174         P00519                      Tyrosine-protein kinase ABL1   ...            Abl               NaN
175         O00238       Bone morphogenetic protein receptor type-1B   ...           STKR             STKR1
176         P31749         RAC-alpha serine/threonine-protein kinase   ...            Akt               NaN
177         Q8NER5                          Activin receptor type-1C   ...           STKR             STKR1
178         P27037                          Activin receptor type-2A   ...           STKR             STKR2
179         P68400                    Casein kinase II subunit alpha   ...            CK2               NaN
180         P15056             Serine/threonine-protein kinase B-raf   ...            RAF               NaN
181         Q06187                       Tyrosine-protein kinase BTK   ...            Tec               NaN
182         Q9C098             Serine/threonine-protein kinase DCLK3   ...         DCAMKL               NaN
183         Q00526                         Cyclin-dependent kinase 3   ...            CDK              CDK2
184         P19784                    Casein kinase II subunit alpha   ...            CK2               NaN
185         Q8NEV1                  Casein kinase II subunit alpha 3   ...            NaN               NaN

there are some rows which are duplicates as shown below

133         Q9UK32               Ribosomal protein S6 kinase alpha-6   ...            RSK            RSKp90
134         Q9UK32               Ribosomal protein S6 kinase alpha-6   ...           RSKb              RSKb

I was wondering what would be the best way to combine the columns of these rows and seperate it with a semicolon if they are different (If they are the same I just want it as a single value). Ideally I would also like to specify order if possible

133         Q9UK32               Ribosomal protein S6 kinase alpha-6   ...            RSK; RSKb            RSKp90; RSKb
2
  • When you combine, do you want columns where the entries are different to be put into a list? How is that data handled, or is it identical all the way through? Commented Jan 11, 2022 at 23:34
  • no array, I would just like to replace my original dataframe with this new one that has differing column values combined for the same uniprot_id(PK) entry Commented Jan 11, 2022 at 23:40

1 Answer 1

2

something like this :

df.groupby(['uniprot_id','protein_name'])[['protein_family','protein_subfamily']].agg('; '.join(x))

probably sort your df before group by :

df.sort_values(['protein_family','protein_subfamily']).groupby(...)

if this is not what you want , you may wanna sort whiting each group then :

df.groupby(['uniprot_id','protein_name'])[['protein_family','protein_subfamily']].agg(lambda x : '; '.join(x.sort_values()))
Sign up to request clarification or add additional context in comments.

2 Comments

thank you, one more question. If I wanted to use this groupby method, is there anyway that I could possibly specify the order when it aggregates, like in my example above if I wanted it ordered entry 134; 133 instead of 133; 134
@InanKhan see updated answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.