0

I have a file

Gene stable GO_ID
AAEL025769 AAEL025769-RA GO:0005525
AAEL020629 AAEL020629-RA GO:0003677
AAEL020629 AAEL020629-RA GO:0005634
AAEL020629 AAEL020629-RA GO:0000786
AAEL020629 AAEL020629-RA GO:0046982
AAEL011255 AAEL011255-RA GO:0005525
AAEL000004 AAEL000004-RA GO:0016021
AAEL000004 AAEL000004-RA GO:0016757
AAEL000004 AAEL000004-RA GO:0005789
AAEL000004 AAEL000004-RA GO:0006506
AAEL000004 AAEL000004-RA GO:0000030
AAEL003589 AAEL003589-RA NA
AAEL026354 AAEL026354-RA NA

For some genes there are multiple GO-IDs (such as AAEL020629 and AAEL000004 in the example above) . For each gene, if there are multiple GO_IDs, I want to combine them all together in single row (separate them by comma and space).

below is my desired output:

Gene    GO_ID
AAEL025769      GO:0005525
AEL020629       GO:0003677, GO:0005634, GO:0000786, GO:0046982
AAEL011255      GO:0005525
AAEL000004      GO:0016021, GO:0016757, GO:0005789, GO:0006506, GO:0000030
AAEL003589      NA
AAEL026354      NA

Any idea how I can do this? Thanks

2
  • 2
    You have already tagged awk, so you know the direction. You can now follow the path, and if you get stuck tell us where. Commented May 27, 2020 at 21:58
  • I guess it can be solved by awk!!!! Commented May 27, 2020 at 22:22

2 Answers 2

1

With Miller

$ mlr --pprint nest --implode --values --across-records --nested-fs ', ' -f GO_ID then cut -x -f stable file 
Gene       GO_ID
AAEL025769 GO:0005525
AAEL020629 GO:0003677, GO:0005634, GO:0000786, GO:0046982
AAEL011255 GO:0005525
AAEL000004 GO:0016021, GO:0016757, GO:0005789, GO:0006506, GO:0000030
AAEL003589 NA
AAEL026354 NA

or (slightly simpler, but with less control over the output) GNU Datamash

$ datamash -HW groupby Gene collapse GO_ID < file
GroupBy(Gene)   collapse(GO_ID)
AAEL025769  GO:0005525
AAEL020629  GO:0003677,GO:0005634,GO:0000786,GO:0046982
AAEL011255  GO:0005525
AAEL000004  GO:0016021,GO:0016757,GO:0005789,GO:0006506,GO:0000030
AAEL003589  NA
AAEL026354  NA
1

Awk could help :

$ awk '{ a[$1]=a[$1]", "$3; }
END { for (i in a) { sub(/,/,"",a[i]);printf "%s %s\n",i,a[i] } }
' file
Gene  GO_ID
AAEL003589  NA
AAEL025769  GO:0005525
AAEL026354  NA
AAEL000004  GO:0016021, GO:0016757, GO:0005789, GO:0006506, GO:0000030
AAEL020629  GO:0003677, GO:0005634, GO:0000786, GO:0046982
AAEL011255  GO:0005525

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.