Awk array with dynamic indices

Question

Lets say I have a tab delimited file lookup.txt

070-031 070-291 030-031
1   2   X
2   3   1
3   4   2
4   5   3
5   6   4
6   7   5
7   8   6
8   9   7

And I have the following files with values to lookup from

$cat 030-031.txt
Line1   070-291 4
Line2   070-031 3

$cat 070-031.txt
Line1   030-031 5
Line2   070-291 8

I would like script.awk to return

$script.awk 030-031.txt lookup.txt
Line1   070-291 4   2
Line2   070-031 3   2

and

$script.awk 070-031.txt lookup.txt
Line1   030-031 5   6
Line2   070-291 8   7

The only thing I can think to do is to create two separate expanded lookup.txt eg

$cat lookup_030-031.txt
070-031:1   X
070-031:2   1
070-031:3   2
070-031:4   3
070-031:5   4
070-031:6   5
070-031:7   6
070-031:8   7
070-291:2   X
070-291:3   1
070-291:4   2
070-291:5   3
070-291:6   4
070-291:7   5
070-291:8   6
070-291:9   7

and then

awk 'NR==FNR { a[$1]=$2;next}{print $0,a[$2":"$3]}' lookup_030-031.txt 030-031.txt

This works but I have many more columns and approximately 10000 rows, so i'd rather not have to generate a lookup file for each. Many thanks

AMENDED

Glenn Jackman's answer is a perfect solution to the initial question and his second answer is more efficient. However, I forgot to stipulate that the script should handle duplicates. For instance, it should be able to handle

$cat 030-031
070-031 3
070-031 6

and return BOTH corresponding numbers for the respective file (2 and 5 respectively). Only Glens first answer handles repeated lookups. His second returns the last values found.

I don't understand what algorithm you're using to generate the desired output. Please elaborate. — glenn jackman
– glenn jackman, Commented May 8, 2014 at 20:56
Certainly. In the second file that I pass to my script field $2 contains the column name (index) that I wish to lookup from. The output column is defined by the file name of the second file so for Line1 of 030-031.txt (Line1 070-291 4) I wish to lookup "4" in column 070-291 and return the corresponding value in column 030-031 which is "2". Many thanks — user2606364
– user2606364, Commented May 8, 2014 at 21:17

glenn jackman · Accepted Answer · 2014-05-08 21:32:45Z

1

OK, I see now. You have to read the lookup file into a big datastructure, then referencing with the individual files is easy.

$ cat script.awk 
BEGIN {OFS = "\t"}
NR==1 {
    for (i=1; i<=NF; i++) 
        label[i] = $i
    next
}
NR==FNR {
    for (i=1; i<=NF; i++) 
        for (j=1; j<=NF; j++) 
            if (i != j) 
                value[label[i],$i,label[j]] = $j
    next
}
FNR==1 {
    split(FILENAME, a, /\./)
    j = a[1]
}
{
    $(NF+1) = value[$1,$2,j]
    print
}

$ awk -f script.awk lookup.txt 030-031.txt
070-291 4   2
070-031 3   2

$ awk -f script.awk lookup.txt 070-031.txt 
030-031 5   6
070-291 8   7

This version is a bit more compact, and passes the filenames in your preferred order:

$  script.awk 
BEGIN {OFS = "\t"}

NR==1 {
    split(FILENAME, a, /\./)
    dest = a[1]
}
NR==FNR {
    src[$1]=$2
    next
}
FNR==1 {
    for (i=1; i<=NF; i++)
        col[$i]=i
    next
}

{
    for (from in src)
        if ($col[from] == src[from])
            print from, src[from], $col[dest]
}

$ awk -f script.awk  030-031.txt   lookup.txt 
070-031 3   2
070-291 4   2

$ awk -f script.awk  070-031.txt  lookup.txt 
030-031 5   6
070-291 8   7

edited May 8, 2014 at 21:32

answered May 8, 2014 at 21:13

glenn jackman

249k42 gold badges233 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

glenn jackman Over a year ago

And if you wanted to get efficient, add alias lookup='awk -f script.awk lookup.txt' so you can use lookup 070-031.txt

user2606364 Over a year ago

Awksome, thanks you. I seem to have reached an impasse with my understanding of two dimensional array structures. Back to the books I think !

glenn jackman Over a year ago

The efficient shorthand for the 2nd version is lookup() { awk -f script.awk "$1" lookup.txt; }

user2606364 Over a year ago

Glenn Jackman's answer is a perfect solution to the initial question. However, I forgot to stipulate that the script should handle duplicates. For instance, it should be able to lookup "070-031 3" AND "070-031 6" and return BOTH corresponding numbers for the respective file (2 and 5 respectively). Currently Glens answer returns only the last looked up value. I will amend my question to include this.

user2606364 Over a year ago

...Correction the first solution handles duplicates...the second doesn't. Many thanks

Raymond Hettinger · Accepted Answer · 2014-05-08 20:46:45Z

0

This works but I have many more columns and approximately 10000 rows, so i'd rather not have to generate a lookup file for each.

Your dataset is small enough to where you have the option of keeping the lookups in memory.

In a BEGIN section, read "lookup.txt" into a two-dimension (nested) array so that:

lookup['070-031'][4] = 3  
lookup['070-291'][5] = 3

The run through all the data files all at once:

script.awk 070-031.txt 070-291.txt

answered May 8, 2014 at 20:46

Raymond Hettinger

229k67 gold badges405 silver badges504 bronze badges

1 Comment

user2606364 Over a year ago

Thanks for your speedy response. I'm struggling to see how I can read in an array with 10000+ records. Also 030-031.txt etc could be 100000+ records in length. Please forgive me if I have misunderstood your suggestion - could you give and example. Many Thanks

Collectives™ on Stack Overflow

Awk array with dynamic indices

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related