1

I have what I think should be a common problem, but I didn't find any good solution for it yet.

I have a file where each line has a chromosome number, a starting position in the chromosome and some related values, like below.

1       1.07299851019   1       1.07299851019   HQ      chrY    2845223         +       0.251366120219  46      
1       1.06860686763   1       1.06860686763   HQ      chr10   88595309        +       0.256830601093  47      
1       1.04688316093   3       3.14064948278   HQ      chr6    49126474        +       0.295081967213  54      
1       1.1563829915    1       1.1563829915    HQ      chrX    16428176        +       0.185792349727  34      

I want to sort this file using unix sort command both on chromosome (column 6) and starting position (column 7). After searching around I came up with this, which got me fairly close:

nohup sort -t $'\t' -k 6.4,6.5n -k 7,7n   

The remaining problem that I can't solve is that while chromosomes numbered with a number is sorted alright chromosome X and chromosome Y are sorted together on starting position like this:

1       0.978579587641  9       8.80721628876   HQ      chrX    2861057 -       0.431693989071  79      
1       0.979500536702  1       0.979500536702  HQ      chrY    2861314 -       0.420765027322  77      
1       0.969979601694  9       8.72981641525   HQ      chrX    2861649 -       0.469945355191  86   

I know it would be possible to solve e.g. by replacing chrX and chrY with numbers, or write a program to solve it, but it would be super nice to be able to use a simple command, especially since the file sizes often are huge and I do this repeatedly.

It would also be nice if the chromosomes line up in order 1 to 22 and then X and then Y. My command had chromosome X and Y coming first and then chromosome 1 to 22.

3 Answers 3

2

To separate X from Y, you can specify a fallback key:

nohup sort -t $'\t' -k 6.4,6.5n -k 6 -k 7,7n

(this says that if two rows are equivalent in the field 6.4,6.5 as compared numerically, then the next step is to compare them in the field 6 non-numerically, before trying field 7).

Disclaimer: this doesn't satisfy the goal in your last paragraph:

It would also be nice if the chromosomes line up in order 1 to 22 and then X and then Y. My command had chromosome X and Y coming first and then chromosome 1 to 22.

because X and Y will still be treated as zero during the numeric sort, and the fallback won't change that. Hopefully you find it useful anyway.

I know it would be possible to solve e.g. by replacing chrX and chrY with numbers, […]

Indeed, you can do that replacement on the fly:

sed 's/chrX/chr23/; s/chrY/chr24/' |
  sort -t $'\t' -k 6.4,6.5n -k 7,7n |
  sed 's/chr23/chrX/; s/chr24/chrY/'

(Note that the line-breaks in this command are optional; I included them for readability, but you can put this on one line, if you want, if/when you actually use it.)

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much ruakh, the sed command does exactly what I want!
2

If your version of sort supports the -V option which is meant for sorting alphanumeric columns then you can do something like:

$ cat file
1   1.07299851019   1   1.07299851019   HQ  chrY    2845223     +   0.251366120219  46
1   1.06860686763   1   1.06860686763   HQ  chr10   88595309    +   0.256830601093  47
1   1.04688316093   3   3.14064948278   HQ  chr6    49126474    +   0.295081967213  54
1   1.1563829915    1   1.1563829915    HQ  chrX    16428176    +   0.185792349727  34

$ sort -t$'\t' -k6V -k7n file
1   1.04688316093   3   3.14064948278   HQ  chr6    49126474    +   0.295081967213  54
1   1.06860686763   1   1.06860686763   HQ  chr10   88595309    +   0.256830601093  47
1   1.1563829915    1   1.1563829915    HQ  chrX    16428176    +   0.185792349727  34
1   1.07299851019   1   1.07299851019   HQ  chrY    2845223     +   0.251366120219  46

1 Comment

Thank you JS, this seems to be what I want. Unfortunately my version doesn't support the -V option.
0

Elaborating on jaypal's answer from before...

You can change the sort criteria per column like so:

sort -k1,1V input.txt

This will sort column 1 and only column 1 using the aforementioned -V option which is as follows quoted from here.

What -V means is “natural sort of (version) numbers within text” (type man sort to find out), and it magically orders numbers and texts.

If you have multiple columns in a tab delimited file and you want to specify the primary column sort order you can do something like the following:

sort -k14,14V -k1,1n input.txt

The above will use column 14 as the first sort index and apply the -V sorting alogrithm, then will use column 1 as the secondary sort index and use numeric sorting. (This might be useful in some circles for sorting by chromosome and then position).

To address the missing -V option for OSX users:

The Mac OS X native sort does not support -V, you’ll have to install GNU core utilities and use gsort instead.

For a quick look at how -V sorting will work you can see the below example...

Example input:

chr21   
chr2    
chr3    
chrY    
chr1    
chr3    
chr10   
chrX    

V sorted output:

chr1    
chr2    
chr3    
chr3    
chr10   
chr21   
chrX    
chrY    

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.