merging multiple rows into one row

Question

I have a file

Gene stable GO_ID
AAEL025769 AAEL025769-RA GO:0005525
AAEL020629 AAEL020629-RA GO:0003677
AAEL020629 AAEL020629-RA GO:0005634
AAEL020629 AAEL020629-RA GO:0000786
AAEL020629 AAEL020629-RA GO:0046982
AAEL011255 AAEL011255-RA GO:0005525
AAEL000004 AAEL000004-RA GO:0016021
AAEL000004 AAEL000004-RA GO:0016757
AAEL000004 AAEL000004-RA GO:0005789
AAEL000004 AAEL000004-RA GO:0006506
AAEL000004 AAEL000004-RA GO:0000030
AAEL003589 AAEL003589-RA NA
AAEL026354 AAEL026354-RA NA

For some genes there are multiple GO-IDs (such as AAEL020629 and AAEL000004 in the example above) . For each gene, if there are multiple GO_IDs, I want to combine them all together in single row (separate them by comma and space).

below is my desired output:

Gene    GO_ID
AAEL025769      GO:0005525
AEL020629       GO:0003677, GO:0005634, GO:0000786, GO:0046982
AAEL011255      GO:0005525
AAEL000004      GO:0016021, GO:0016757, GO:0005789, GO:0006506, GO:0000030
AAEL003589      NA
AAEL026354      NA

Any idea how I can do this? Thanks

You have already tagged awk, so you know the direction. You can now follow the path, and if you get stuck tell us where. — Quasímodo
– Quasímodo, Commented May 27, 2020 at 21:58

steeldriver · Accepted Answer · 2020-05-27 22:52:21Z

With Miller

$ mlr --pprint nest --implode --values --across-records --nested-fs ', ' -f GO_ID then cut -x -f stable file 
Gene       GO_ID
AAEL025769 GO:0005525
AAEL020629 GO:0003677, GO:0005634, GO:0000786, GO:0046982
AAEL011255 GO:0005525
AAEL000004 GO:0016021, GO:0016757, GO:0005789, GO:0006506, GO:0000030
AAEL003589 NA
AAEL026354 NA

or (slightly simpler, but with less control over the output) GNU Datamash

$ datamash -HW groupby Gene collapse GO_ID < file
GroupBy(Gene)   collapse(GO_ID)
AAEL025769  GO:0005525
AAEL020629  GO:0003677,GO:0005634,GO:0000786,GO:0046982
AAEL011255  GO:0005525
AAEL000004  GO:0016021,GO:0016757,GO:0005789,GO:0006506,GO:0000030
AAEL003589  NA
AAEL026354  NA

Stalin Vignesh Kumar · Accepted Answer · 2020-05-28 10:38:11Z

1

Awk could help :

$ awk '{ a[$1]=a[$1]", "$3; }
END { for (i in a) { sub(/,/,"",a[i]);printf "%s %s\n",i,a[i] } }
' file
Gene  GO_ID
AAEL003589  NA
AAEL025769  GO:0005525
AAEL026354  NA
AAEL000004  GO:0016021, GO:0016757, GO:0005789, GO:0006506, GO:0000030
AAEL020629  GO:0003677, GO:0005634, GO:0000786, GO:0046982
AAEL011255  GO:0005525

edited May 28, 2020 at 10:38

answered May 28, 2020 at 9:51

Stalin Vignesh Kumar

1,8058 silver badges15 bronze badges

Add a comment |

Stack Exchange Network

merging multiple rows into one row

2 Answers 2

You must log in to answer this question.

Hot Network Questions

merging multiple rows into one row

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions