Comparing 2 files using AWK with multiple parameters

Question

I have a problem while comparing 2 text files using awk. Here is what I want to do.

File1 contains a name in the first column which has to match the name in the first column of file2. That's easy - so far so good. Then if this matches, I need to check whether the number in the 2nd column of file1 lays within the numeric range of column 2 and 3 in file2 (see example). If that's the case print both matching lines as one line to a new file. I wrote something in awk and it gives me an output with correct assignments but it misses the majority. Am I missing some kind of loop function? The files are both sorted according to the first column.

File1:

scaffold10|   300   T   C   0.9695   0.0000
scaffold10|   456   T   A   1.0000   0.0000
scaffold10|   470   C   A   0.9906   0.0000
scaffold10|   600   T   C   0.8423   0.0000
scaffold56|   5     A   C   0.8423   0.0000
scaffold56|   1000  C   T   0.8423   0.0000
scaffold56|   6000  C   C   0.7518   0.0000
scaffold7|    2     T   T   0.9046   0.0000
scaffold9|    300   T   T   0.9034   0.0000
scaffold9|    10900 T   G   0.9044   0.0000

File2:

scaffold10|   400   550   
scaffold10|   700   800    
scaffold56|   3     5000  
scaffold7|    55    200  
scaffold7|    214   567   
scaffold7|    656   800  
scaffold9|    234   675  
scaffold9|    699   1254 
scaffold9|    10887 11000

Output:

scaffold10|  456   T   A   1.0000   0.0000   scaffold10|  400   550
scaffold10|  470   C   A   0.9906   0.0000   scaffold10|  400   550
scaffold56|  5     A   C   0.8423   0.0000   scaffold56|  3     5000
scaffold56|  1000  C   T   0.8423   0.0000   scaffold56|  3     5000
scaffold9|   300   T   T   0.9034   0.0000   scaffold9|   234   675 
scaffold9|   10900 T   G   0.9044   0.0000   scaffold9|   10887 11000

My awk try:

awk -F "\t" ' FNR==NR {b[$1]=$0; c[$1]=$1; d[$1]=$2; e[$1]=$3; next} for {if (c[$1]==$1 && d[$1]<=$2 && e[$1]>=$2) {print b[$1]"\t"$0}}' File1 File2 > out.txt

How can I get the output I want using awk? Any suggestions are very welcome...

That awk script has a syntax error. The for isn't valid there. That being said you are also collapsing multiple rows in File1 in your assignments incorrectly. You key your b, c, d, and e tables off of field $1 but that field duplicates across lines so you will only every store the last line for a given value. — Etan Reisner
– Etan Reisner, Commented Aug 7, 2014 at 17:03
Given your requirements I imagine you might find it easier to operate on the files the other way around also. That is to capture the ranges first and then compare the lines from File1 against them as you see them. — Etan Reisner
– Etan Reisner, Commented Aug 7, 2014 at 17:07

Ross Ridge · Accepted Answer · 2014-08-07 17:31:05Z

2

Use join to do a database style join of the two files and then use AWK to filter out the incorrect matches:

$ join file1 file2 | awk '$2 >= $7 && $2 <= $8'
scaffold10| 456 T A 1.0000 0.0000 400 550
scaffold10| 470 C A 0.9906 0.0000 400 550
scaffold56| 5 A C 0.8423 0.0000 3 5000
scaffold56| 1000 C T 0.8423 0.0000 3 5000
scaffold9| 300 T T 0.9034 0.0000 234 675
scaffold9| 10900 T G 0.9044 0.0000 10887 11000

Or if you want the output formatted the same the way it is in the example you gave:

$ join file1 file2 | awk '$2 >= $7 && $2 <= $8 { printf("%-12s %-5s %-3s %-3s %-8s %-8s %-12s %-5s %-5s\n", $1, $2, $3, $4, $5, $6, $1, $7, $8); }'
scaffold10|  456   T   A   1.0000   0.0000   scaffold10|  400   550
scaffold10|  470   C   A   0.9906   0.0000   scaffold10|  400   550
scaffold56|  5     A   C   0.8423   0.0000   scaffold56|  3     5000
scaffold56|  1000  C   T   0.8423   0.0000   scaffold56|  3     5000
scaffold9|   300   T   T   0.9034   0.0000   scaffold9|   234   675
scaffold9|   10900 T   G   0.9044   0.0000   scaffold9|   10887 11000

edited Aug 7, 2014 at 17:31

answered Aug 7, 2014 at 17:20

Ross Ridge

39.9k7 gold badges94 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

bex Over a year ago

Thank you so much Ross. It worked. You helped me really a lot!! A really smart solution =)

Andre Wildberg · Accepted Answer · 2020-12-30 22:45:06Z

A awk solution that reads in the first file into an array and then compares it on the fly with the content of the second file.

awk 'NR==FNR{i++; x[i]=$0; x_1[i]=$2; x_2[i]=$3 }
     NR!=FNR{ for(j=1;j<=i;j++){
                if( $1~x[j] && x_1[j]<$2 && x_2[j]>$2 ){
                  print $0,x[j]
                }
              }
}' file2 file1

# scaffold10|   456   T   A   1.0000   0.0000 scaffold10|   400   550   
# scaffold10|   470   C   A   0.9906   0.0000 scaffold10|   400   550   
# scaffold56|   5     A   C   0.8423   0.0000 scaffold56|   3     5000  
# scaffold56|   1000  C   T   0.8423   0.0000 scaffold56|   3     5000  
# scaffold9|    300   T   T   0.9034   0.0000 scaffold9|    234   675  
# scaffold9|    10900 T   G   0.9044   0.0000 scaffold9|    10887 11000

Collectives™ on Stack Overflow

Comparing 2 files using AWK with multiple parameters

File1:

File2:

Output:

My awk try:

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

File1:

File2:

Output:

My awk try:

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related