Wanted: nonassociative array in AWK

Question

I am looking at speeding up some code in AWK by serially processing an internal string representation instead of serially processing an array representation. (For Examples 1 and 2 below assume datasep is a single character that is not part of the computable data.) So instead of:

# Example 1
split(datastr,data,datasep)
for(i=1; i in data; i++) { 
     # Use data[i]
     }

I want to try something like

 # Example 2 - buggy code assumes datastr terminated by datasep
 l= length(datastr)
 for(j=1; j < l ; ) {
      datumlen = match(substr(datastr,j+1),datasep)
      #Use substr(datastr,j+1,datumlen-1)
      j+=datumlen
      }

This is because I want to save memory and lookup time that is involved in using an associative array (data), and also because I have faith in how match and substr are implemented. I plan to start with datastr having length > 10^6 bytes (with datumlen < 5 most of the time), and push it up from there. I can stream out the results, so I am not worried about memory requirements of the code, but I may need to make more than one pass over datastr, so I would like to avoid streaming datastr (unless that is even faster).

So the question is: Are there memory- and access-efficient routines that improve on Example 1 and look something like Example 2? Or would I be better off trusting the internal buffering AWK and the system use to process input files, and just make several passes over the same input file?

EDIT 2015.09.18: (I am not registered in this forum yet, so answering a comment here.) I am using gawk 4.1.3 on a non-Unix platform. I am interested in having a small portable environment in which to do certain types of computing. I do not know enough about gawk internals, and thought perhaps someone reading this forum had tried something like this before. I will end up profiling the different ways if I receive no other suggestions. END EDIT 2015.09.18

Gerhard "Ask Me About System Tweaking" Paseman, 2015.09.16

If you're talking about performance, you need to specify which version of awk you're using. Have you done benchmarks to see if this was really a showstopper? If it is, why aren't you considering other languages that offer a richer set of data structures? — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Sep 16, 2015 at 23:52

Gerhard Paseman · Accepted Answer · 2015-09-30 20:35:03Z

Here is some test code for gawk 4.1.3 I wrote to answer the question. The original data in PFILE was numeric, and I was trying to compress things by storing differences between consecutive entries in DFILE.

BEGIN{ RLS=bufstr=""; SEP =":" ; PFILE="somenumbers.txt" ; DFILE= "diffile.txt"
if (ATEST=="") ATEST=1
accumulate=lastdatum=0 ; BIGN=5500000 ; DATALENMAX=7 ;TUNELEN=2048
for(i=1; i < BIGN ; i++) {
     getline nextdatum < PFILE
     d = nextdatum -lastdatum
#     RLS = RLS d SEP
     ibuf( d SEP )
     print d > DFILE
     lastdatum=nextdatum  }
# RLS = RLS "0"
ibuf("0")
if (length(bufstr) > 0) { RLS = RLS bufstr ; bufstr="" }
print (RLSlen=length(RLS))
close(PFILE) ; close(DFILE)
timestmp["start"] = systime()
if (ATEST==1){
  split(RLS,data,SEP)
  timestmp["endsplit"] = systime()
  for(i=1; i in data; i++){     accumulate += 1*data[i]     }
  }
if (ATEST==2){
  for(j=1; j<RLSlen ; j+=datalen) {
     datalen=match(substr(RLS,j, DATALENMAX),SEP)
     accumulate  += 1*substr(RLS,j,datalen-1)     }
  }
if (ATEST==3) {
  while((getline diff < DFILE)>0){  accumulate  += 1*diff }
  close(DFILE)
  }
print accumulate 
timestmp["end"] = systime()
for(t in timestmp) print t, (1*timestmp[t] - 1*timestmp["start"])
}

function ibuf(str) {   bufstr=bufstr str
   if (length(bufstr) > TUNELEN) { RLS = RLS bufstr ; bufstr="" }
}

The ibuf() function and TUNELEN parameter aren't crucial, I just got tired of seeing the allocated memory value thrash back and forth because of the assignment

RLS = RLS d SEP

so I decided to buffer that part.

I expected the second and third sections (ATEST=2 and 3) to perform a little faster than the first section. That did not happen. Working with the arrays always seemed a little faster, in the extreme about twice as fast as section 2, and a little faster than section 3. However, the array version used about 10 times (or more) memory because of having to store indices as well as values.

I initially tested section 2 without a DATAMAXLEN value, and that made things very slow because of the repeated substr() call. The section 2 method definitely does not give more speed, although it does save on memory used by the input data.

In all, if you have memory to burn, use associative arrays. If you have a good disk, read from a file. If you have to conserve, creep through the string, but be careful just to look at small pieces. On my system, I may run into memory limits, so I will probably read from a file for my application. If anyone sees a way to tweak section 2, say by using index or some other memory-conserving means to access the string, I would like to know of it.

Gerhard "My Mileage Varies Pretty Often" Paseman, 2015.09.30

Although it is fine to answer your own question post, it is not ok to use chit-chat on this Q&A site, as you would have known had you read the help→tour. Having your name under a post twice is really not necessary and degrades an otherwise possible good post. Additionally it is not place in answer to ask additional questions. — Anthon
– Anthon, Commented Sep 30, 2015 at 20:46

Stack Exchange Network

Wanted: nonassociative array in AWK

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Wanted: nonassociative array in AWK

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions