1

Lets say I have a tab delimited file lookup.txt

070-031 070-291 030-031
1   2   X
2   3   1
3   4   2
4   5   3
5   6   4
6   7   5
7   8   6
8   9   7

And I have the following files with values to lookup from

$cat 030-031.txt
Line1   070-291 4
Line2   070-031 3

$cat 070-031.txt
Line1   030-031 5
Line2   070-291 8

I would like script.awk to return

$script.awk 030-031.txt lookup.txt
Line1   070-291 4   2
Line2   070-031 3   2

and

$script.awk 070-031.txt lookup.txt
Line1   030-031 5   6
Line2   070-291 8   7

The only thing I can think to do is to create two separate expanded lookup.txt eg

$cat lookup_030-031.txt
070-031:1   X
070-031:2   1
070-031:3   2
070-031:4   3
070-031:5   4
070-031:6   5
070-031:7   6
070-031:8   7
070-291:2   X
070-291:3   1
070-291:4   2
070-291:5   3
070-291:6   4
070-291:7   5
070-291:8   6
070-291:9   7

and then

awk 'NR==FNR { a[$1]=$2;next}{print $0,a[$2":"$3]}' lookup_030-031.txt 030-031.txt

This works but I have many more columns and approximately 10000 rows, so i'd rather not have to generate a lookup file for each. Many thanks

AMENDED

Glenn Jackman's answer is a perfect solution to the initial question and his second answer is more efficient. However, I forgot to stipulate that the script should handle duplicates. For instance, it should be able to handle

$cat 030-031
070-031 3
070-031 6

and return BOTH corresponding numbers for the respective file (2 and 5 respectively). Only Glens first answer handles repeated lookups. His second returns the last values found.

2
  • 2
    I don't understand what algorithm you're using to generate the desired output. Please elaborate. Commented May 8, 2014 at 20:56
  • Certainly. In the second file that I pass to my script field $2 contains the column name (index) that I wish to lookup from. The output column is defined by the file name of the second file so for Line1 of 030-031.txt (Line1 070-291 4) I wish to lookup "4" in column 070-291 and return the corresponding value in column 030-031 which is "2". Many thanks Commented May 8, 2014 at 21:17

2 Answers 2

1

OK, I see now. You have to read the lookup file into a big datastructure, then referencing with the individual files is easy.

$ cat script.awk 
BEGIN {OFS = "\t"}
NR==1 {
    for (i=1; i<=NF; i++) 
        label[i] = $i
    next
}
NR==FNR {
    for (i=1; i<=NF; i++) 
        for (j=1; j<=NF; j++) 
            if (i != j) 
                value[label[i],$i,label[j]] = $j
    next
}
FNR==1 {
    split(FILENAME, a, /\./)
    j = a[1]
}
{
    $(NF+1) = value[$1,$2,j]
    print
}

$ awk -f script.awk lookup.txt 030-031.txt
070-291 4   2
070-031 3   2

$ awk -f script.awk lookup.txt 070-031.txt 
030-031 5   6
070-291 8   7

This version is a bit more compact, and passes the filenames in your preferred order:

$  script.awk 
BEGIN {OFS = "\t"}

NR==1 {
    split(FILENAME, a, /\./)
    dest = a[1]
}
NR==FNR {
    src[$1]=$2
    next
}
FNR==1 {
    for (i=1; i<=NF; i++)
        col[$i]=i
    next
}

{
    for (from in src)
        if ($col[from] == src[from])
            print from, src[from], $col[dest]
}

$ awk -f script.awk  030-031.txt   lookup.txt 
070-031 3   2
070-291 4   2

$ awk -f script.awk  070-031.txt  lookup.txt 
030-031 5   6
070-291 8   7
Sign up to request clarification or add additional context in comments.

5 Comments

And if you wanted to get efficient, add alias lookup='awk -f script.awk lookup.txt' so you can use lookup 070-031.txt
Awksome, thanks you. I seem to have reached an impasse with my understanding of two dimensional array structures. Back to the books I think !
The efficient shorthand for the 2nd version is lookup() { awk -f script.awk "$1" lookup.txt; }
Glenn Jackman's answer is a perfect solution to the initial question. However, I forgot to stipulate that the script should handle duplicates. For instance, it should be able to lookup "070-031 3" AND "070-031 6" and return BOTH corresponding numbers for the respective file (2 and 5 respectively). Currently Glens answer returns only the last looked up value. I will amend my question to include this.
...Correction the first solution handles duplicates...the second doesn't. Many thanks
0

This works but I have many more columns and approximately 10000 rows, so i'd rather not have to generate a lookup file for each.

Your dataset is small enough to where you have the option of keeping the lookups in memory.

In a BEGIN section, read "lookup.txt" into a two-dimension (nested) array so that:

lookup['070-031'][4] = 3  
lookup['070-291'][5] = 3  

The run through all the data files all at once:

script.awk 070-031.txt 070-291.txt

1 Comment

Thanks for your speedy response. I'm struggling to see how I can read in an array with 10000+ records. Also 030-031.txt etc could be 100000+ records in length. Please forgive me if I have misunderstood your suggestion - could you give and example. Many Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.