What linux commands can I use to sort columns in a tab-separated text file?

Question

I need to compare two versions of the same file. Both are tab-separated and have this form:

<filename1><tab><Marker11><tab><Marker12>...
<filename2><tab><Marker21><tab><Marker22><tab><Marker22>...

So each row has a different number of markers (the number varies between 1 and 10) and they all come from a small set of possible markers. So a file looks like this:

fileX<tab>Z<tab>M<tab>A
fileB<tab>Y
fileM<tab>M<tab>C<tab>B<tab>Y

What I need is:

Sort the file by rows
Sort the markers in each row so that they are in alphabetical order

So for the example above, the result would be

fileB<tab>Y
fileM<tab>B<tab>C<tab>M<tab>Y
fileX<tab>A<tab>M<tab>Z

It's easy to do #1 using sort but how do I do #2?

UPDATE: It's not a duplicate of this post since my rows are of different length and I need each rows (the entries after the filename) sorted individually, i.e. the only column that gets preserved is the first one.

Possible duplicate of Using bash to sort data horizontally

binduck
– binduck

2017-07-13 16:53:55 +00:00
Commented Jul 13, 2017 at 16:53 — binduck
– binduck, Commented Jul 13, 2017 at 16:53

RomanPerekhrest · Accepted Answer · 2017-07-13 17:14:09Z

1

awk solution:

awk 'BEGIN{ FS=OFS="\t"; PROCINFO["sorted_in"]="@ind_str_asc" }
     { split($0,b,FS); delete b[1]; asort(b); r=""; 
         for(i in b) r=(r!="")? r OFS b[i] : b[i]; a[$1] = r 
     }
     END{ for(i in a) print i,a[i] }' file

The output:

fileB   Y
fileM   B   C   M   Y
fileX   A   M   Z

PROCINFO["sorted_in"]="@ind_str_asc" - sort mode
split($0,b,FS); - split the line into array b by FS (field separator)
asort(b) - sort marker values

answered Jul 13, 2017 at 17:14

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ed Morton · Accepted Answer · 2017-07-13 19:39:56Z

1

All you need is:

awk '
{ for (i=2;i<=NF;i++) arr[$1][$i] }
END {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (i in arr) {
        printf "%s", i
        for (j in arr[i]) {
            printf "%s%s, OFS, arr[i][j]
        }
        print ""
    }
}
' file

The above uses GNU awk for true multi-dimensional arrays plus sorted_in

edited Jul 13, 2017 at 19:39

answered Jul 13, 2017 at 19:32

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

4 Comments

hek2mgl Over a year ago

Good answer. And it would be nice if once (in xxx years :) ) predictable iterations over sorted arrays in awk would be POSIX. I would just recommend to explicitly use gawk instead of awk. (Which is also a kind of advertisment ;) )

hek2mgl Over a year ago

Actually it should not break anything when sorted arrays get added under the hood. Python3.7 was doing the same with the dict type. Code that assumes the array to be unsorted should still work.

Ed Morton Over a year ago

The problem with default sorted arrays is there is no order that's better than any other order (alphabetic? numeric? first in? incrementing? decrementing? etc.) so hash order is best as the default since it's most efficient.

hek2mgl Over a year ago

I see. The impressive performance of awk should definitely stay one of the major goals.

Collectives™ on Stack Overflow

What linux commands can I use to sort columns in a tab-separated text file?

2 Answers 2

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related