1

I need to compare two versions of the same file. Both are tab-separated and have this form:

<filename1><tab><Marker11><tab><Marker12>...
<filename2><tab><Marker21><tab><Marker22><tab><Marker22>...

So each row has a different number of markers (the number varies between 1 and 10) and they all come from a small set of possible markers. So a file looks like this:

fileX<tab>Z<tab>M<tab>A
fileB<tab>Y
fileM<tab>M<tab>C<tab>B<tab>Y

What I need is:

  1. Sort the file by rows
  2. Sort the markers in each row so that they are in alphabetical order

So for the example above, the result would be

fileB<tab>Y
fileM<tab>B<tab>C<tab>M<tab>Y
fileX<tab>A<tab>M<tab>Z

It's easy to do #1 using sort but how do I do #2?

UPDATE: It's not a duplicate of this post since my rows are of different length and I need each rows (the entries after the filename) sorted individually, i.e. the only column that gets preserved is the first one.

1

2 Answers 2

1

awk solution:

awk 'BEGIN{ FS=OFS="\t"; PROCINFO["sorted_in"]="@ind_str_asc" }
     { split($0,b,FS); delete b[1]; asort(b); r=""; 
         for(i in b) r=(r!="")? r OFS b[i] : b[i]; a[$1] = r 
     }
     END{ for(i in a) print i,a[i] }' file

The output:

fileB   Y
fileM   B   C   M   Y
fileX   A   M   Z

  • PROCINFO["sorted_in"]="@ind_str_asc" - sort mode

  • split($0,b,FS); - split the line into array b by FS (field separator)

  • asort(b) - sort marker values

Sign up to request clarification or add additional context in comments.

Comments

1

All you need is:

awk '
{ for (i=2;i<=NF;i++) arr[$1][$i] }
END {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (i in arr) {
        printf "%s", i
        for (j in arr[i]) {
            printf "%s%s, OFS, arr[i][j]
        }
        print ""
    }
}
' file

The above uses GNU awk for true multi-dimensional arrays plus sorted_in

4 Comments

Good answer. And it would be nice if once (in xxx years :) ) predictable iterations over sorted arrays in awk would be POSIX. I would just recommend to explicitly use gawk instead of awk. (Which is also a kind of advertisment ;) )
Actually it should not break anything when sorted arrays get added under the hood. Python3.7 was doing the same with the dict type. Code that assumes the array to be unsorted should still work.
The problem with default sorted arrays is there is no order that's better than any other order (alphabetic? numeric? first in? incrementing? decrementing? etc.) so hash order is best as the default since it's most efficient.
I see. The impressive performance of awk should definitely stay one of the major goals.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.