4

I'm working on an Awk/Gawk script that parses a file, populating a multidimensional array for each line. The first column is a period delimited string, with each value being a reference to the array key for the next level. The 2nd column is the value

Here's an example of what the content being parsed looks like:

$ echo -e "personal.name.first\t= John\npersonal.name.last\t= Doe\npersonal.other.dob\t= 05/07/87\npersonal.contact.phone\t= 602123456\npersonal.contact.email\t= john.doe@idk\nemployment.jobs.1\t= Company One\nemployment.jobs.2\t= Company Two\nemployment.jobs.3\t= Company Three"
personal.name.first     = John
personal.name.last      = Doe
personal.other.dob      = 05/07/87
personal.contact.phone  = 602123456
personal.contact.email  = john.doe@idk
employment.jobs.1       = Company One
employment.jobs.2       = Company Two
employment.jobs.3       = Company Three

Which after being parsed, Im expecting it to have the same structure as:

data["personal"]["name"]["first"]     = "John"
data["personal"]["name"]["last"]      = "Doe"
data["personal"]["other"]["dob"]      = "05/07/87"
data["personal"]["contact"]["phone"]  = "602123456"
data["personal"]["contact"]["email"]  = "[email protected]"
data["employment"]["jobs"]["1"]       = Company One
data["employment"]["jobs"]["2"]       = Company Two
data["employment"]["jobs"]["3"]       = Company Three

The part that I'm stuck on is how to dynamically populate the keys while structuring the multidimensional array.

I found this SO thread that covers a similar issue, which was resolved by using the SUBSEP variable, which at first seemed like it would work as I needed, but after some testing, it looks like arr["foo", "bar"] = "baz" doesn't get treated like a real array, such as arr["foo"]["bar"] = "baz" would. An example of what I mean by that would be the inability to count the values in any level of the array: arr["foo", "bar"] = "baz"; print length(arr["foo"]) would simply print a 0 (zero)

I found this SO thread which helps a little, possibly pointing me in the right direction.

In a snippet in the thread mentioned:

BEGIN {
  x=SUBSEP

  a="Red" x "Green" x "Blue"
  b="Yellow" x "Cyan" x "Purple"

  Colors[1][0] = ""
  Colors[2][0] = ""

  split(a, Colors[1], x)
  split(b, Colors[2], x)

  print Colors[2][3]
}

Is pretty close, but the problem I'm having now is the fact that the keys (EG: Red, Green, etc) need to be specified dynamically, and there could be one or more keys.

Basically, how can I take the a_keys and b_keys strings, split them by ., and populate the a and b variables as multidimensional arrays?..

BEGIN {
  x=SUBSEP

  # How can I take these strings...
  a_keys = "Red.Green.Blue"
  b_keys = "Yellow.Cyan.Purple"

  # .. And populate the array, just as this does:
  a="Red" x "Green" x "Blue"
  b="Yellow" x "Cyan" x "Purple"

  Colors[1][0] = ""
  Colors[2][0] = ""

  split(a, Colors[1], x)
  split(b, Colors[2], x)

  print Colors[2][3]
}

Any help would be appreciated, thanks!

2 Answers 2

3

All you need is:

BEGIN { FS="\t= " }
{
    split($1,d,/\./)
    data[d[1]][d[2]][d[3]] = $2
}

Look:

$ cat tst.awk
BEGIN { FS="\t= " }
{
    split($1,d,/\./)
    data[d[1]][d[2]][d[3]] = $2
}
END {
    for (x in data)
        for (y in data[x])
            for (z in data[x][y])
                print x, y, z, "->", data[x][y][z]
}

$ awk -f tst.awk file
personal other dob -> 05/07/87
personal name first -> John
personal name last -> Doe
personal contact email -> john.doe@idk
personal contact phone -> 602123456
employment jobs 1 -> Company One
employment jobs 2 -> Company Two
employment jobs 3 -> Company Three

The above is gawk-specific of course since no other awk supports true multi-dimensional arrays.

To populate a multi-dimensional array when the indices aren't always of the same depth (e.g. 3 above), it's rather more complicated:

##########
$ cat tst.awk
function rec_populate(a,idxs,curDepth,maxDepth,tmpIdxSet) {
    if ( tmpIdxSet ) {
        delete a[SUBSEP]                # delete scalar a[]
        tmpIdxSet = 0
    }
    if (curDepth < maxDepth) {
        # We need to ensure a[][] exists before calling populate() otherwise
        # inside populate() a[] would be a scalar, but then we need to delete
        # a[][] inside populate() before trying to create a[][][] because
        # creating a[][] below creates IT as scalar. SUBSEP used arbitrarily.

        if ( !( (idxs[curDepth] in a) && (SUBSEP in a[idxs[curDepth]]) ) ) {
            a[idxs[curDepth]][SUBSEP]   # create array a[] + scalar a[][]
            tmpIdxSet = 1
        }
        rec_populate(a[idxs[curDepth]],idxs,curDepth+1,maxDepth,tmpIdxSet)
    }
    else {
        a[idxs[curDepth]] = $2
    }
}

function populate(arr,str,sep,  idxs) {
    split(str,idxs,sep)
    rec_populate(arr,idxs,1,length(idxs),0)
}

{ populate(arr,$1,",") }

END { walk_array(arr, "arr") }

function walk_array(arr, name,      i)
{
    # Mostly copied from the following URL, just added setting of "sorted_in":
    #   https://www.gnu.org/software/gawk/manual/html_node/Walking-Arrays.html
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (i in arr) {
        if (isarray(arr[i]))
            walk_array(arr[i], (name "[" i "]"))
        else
            printf("%s[%s] = %s\n", name, i, arr[i])
    }
}

.

##########
$ cat file
a uno
b,c dos
d,e,f tres_wan
d,e,g tres_twa
d,e,h,i,j cinco

##########
$ awk -f tst.awk file
arr[a] = uno
arr[b][c] = dos
arr[d][e][f] = tres_wan
arr[d][e][g] = tres_twa
arr[d][e][h][i][j] = cinco
Sign up to request clarification or add additional context in comments.

2 Comments

I may have forgotten to mention that the keys may not always be exactly three segments.. But this is a good enough start for me to work with. Thanks!
Then you'll need the second solution I posted. Stating the obvious - if your real data has multiple ranks of segments then your sample input should have too.
0

without real multidim arrays, you can do little more bookkeeping

awk -F'\t= ' '{split($1,k,"."); 
               k1[k[1]]; k2[k[2]]; k3[k[3]]; 
               v[k[1],k[2],k[3]]=$2}
          END {for(i1 in k1) 
                 for(i2 in k2)
                   for(i3 in k3) 
                     if((i1,i2,i3) in v) 
                       print i1,i2,i3," -> ",v[i1,i2,i3]}' file


personal other dob  ->  05/07/87
personal name first  ->  John
personal name last  ->  Doe
personal contact email  ->  john.doe@idk
personal contact phone  ->  602123456
employment jobs 1  ->  Company One
employment jobs 2  ->  Company Two
employment jobs 3  ->  Company Three

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.