3

I have a problem while comparing 2 text files using awk. Here is what I want to do.

File1 contains a name in the first column which has to match the name in the first column of file2. That's easy - so far so good. Then if this matches, I need to check whether the number in the 2nd column of file1 lays within the numeric range of column 2 and 3 in file2 (see example). If that's the case print both matching lines as one line to a new file. I wrote something in awk and it gives me an output with correct assignments but it misses the majority. Am I missing some kind of loop function? The files are both sorted according to the first column.

File1:

scaffold10|   300   T   C   0.9695   0.0000
scaffold10|   456   T   A   1.0000   0.0000
scaffold10|   470   C   A   0.9906   0.0000
scaffold10|   600   T   C   0.8423   0.0000
scaffold56|   5     A   C   0.8423   0.0000
scaffold56|   1000  C   T   0.8423   0.0000
scaffold56|   6000  C   C   0.7518   0.0000
scaffold7|    2     T   T   0.9046   0.0000
scaffold9|    300   T   T   0.9034   0.0000
scaffold9|    10900 T   G   0.9044   0.0000

File2:

scaffold10|   400   550   
scaffold10|   700   800    
scaffold56|   3     5000  
scaffold7|    55    200  
scaffold7|    214   567   
scaffold7|    656   800  
scaffold9|    234   675  
scaffold9|    699   1254 
scaffold9|    10887 11000   

Output:

scaffold10|  456   T   A   1.0000   0.0000   scaffold10|  400   550
scaffold10|  470   C   A   0.9906   0.0000   scaffold10|  400   550
scaffold56|  5     A   C   0.8423   0.0000   scaffold56|  3     5000
scaffold56|  1000  C   T   0.8423   0.0000   scaffold56|  3     5000
scaffold9|   300   T   T   0.9034   0.0000   scaffold9|   234   675 
scaffold9|   10900 T   G   0.9044   0.0000   scaffold9|   10887 11000 

My awk try:

awk -F "\t" ' FNR==NR {b[$1]=$0; c[$1]=$1; d[$1]=$2; e[$1]=$3; next} for {if (c[$1]==$1 && d[$1]<=$2 && e[$1]>=$2) {print b[$1]"\t"$0}}' File1 File2 > out.txt

How can I get the output I want using awk? Any suggestions are very welcome...

3
  • That awk script has a syntax error. The for isn't valid there. That being said you are also collapsing multiple rows in File1 in your assignments incorrectly. You key your b, c, d, and e tables off of field $1 but that field duplicates across lines so you will only every store the last line for a given value. Commented Aug 7, 2014 at 17:03
  • Given your requirements I imagine you might find it easier to operate on the files the other way around also. That is to capture the ranges first and then compare the lines from File1 against them as you see them. Commented Aug 7, 2014 at 17:07
  • Thank you Etan for pointing me to my mistakes. Commented Aug 8, 2014 at 9:25

2 Answers 2

2

Use join to do a database style join of the two files and then use AWK to filter out the incorrect matches:

$ join file1 file2 | awk '$2 >= $7 && $2 <= $8'
scaffold10| 456 T A 1.0000 0.0000 400 550
scaffold10| 470 C A 0.9906 0.0000 400 550
scaffold56| 5 A C 0.8423 0.0000 3 5000
scaffold56| 1000 C T 0.8423 0.0000 3 5000
scaffold9| 300 T T 0.9034 0.0000 234 675
scaffold9| 10900 T G 0.9044 0.0000 10887 11000

Or if you want the output formatted the same the way it is in the example you gave:

$ join file1 file2 | awk '$2 >= $7 && $2 <= $8 { printf("%-12s %-5s %-3s %-3s %-8s %-8s %-12s %-5s %-5s\n", $1, $2, $3, $4, $5, $6, $1, $7, $8); }'
scaffold10|  456   T   A   1.0000   0.0000   scaffold10|  400   550
scaffold10|  470   C   A   0.9906   0.0000   scaffold10|  400   550
scaffold56|  5     A   C   0.8423   0.0000   scaffold56|  3     5000
scaffold56|  1000  C   T   0.8423   0.0000   scaffold56|  3     5000
scaffold9|   300   T   T   0.9034   0.0000   scaffold9|   234   675
scaffold9|   10900 T   G   0.9044   0.0000   scaffold9|   10887 11000
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much Ross. It worked. You helped me really a lot!! A really smart solution =)
0

A awk solution that reads in the first file into an array and then compares it on the fly with the content of the second file.

awk 'NR==FNR{i++; x[i]=$0; x_1[i]=$2; x_2[i]=$3 }
     NR!=FNR{ for(j=1;j<=i;j++){
                if( $1~x[j] && x_1[j]<$2 && x_2[j]>$2 ){
                  print $0,x[j]
                }
              }
}' file2 file1

# scaffold10|   456   T   A   1.0000   0.0000 scaffold10|   400   550   
# scaffold10|   470   C   A   0.9906   0.0000 scaffold10|   400   550   
# scaffold56|   5     A   C   0.8423   0.0000 scaffold56|   3     5000  
# scaffold56|   1000  C   T   0.8423   0.0000 scaffold56|   3     5000  
# scaffold9|    300   T   T   0.9034   0.0000 scaffold9|    234   675  
# scaffold9|    10900 T   G   0.9044   0.0000 scaffold9|    10887 11000

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.