I am looking at speeding up some code in AWK by serially processing an internal string representation instead of serially processing an array representation. (For Examples 1 and 2 below assume datasep is a single character that is not part of the computable data.) So instead of:
# Example 1
split(datastr,data,datasep)
for(i=1; i in data; i++) {
# Use data[i]
}
I want to try something like
# Example 2 - buggy code assumes datastr terminated by datasep
l= length(datastr)
for(j=1; j < l ; ) {
datumlen = match(substr(datastr,j+1),datasep)
#Use substr(datastr,j+1,datumlen-1)
j+=datumlen
}
This is because I want to save memory and lookup time that is involved in using an associative array (data), and also because I have faith in how match and substr are implemented. I plan to start with datastr having length > 10^6 bytes (with datumlen < 5 most of the time), and push it up from there. I can stream out the results, so I am not worried about memory requirements of the code, but I may need to make more than one pass over datastr, so I would like to avoid streaming datastr (unless that is even faster).
So the question is: Are there memory- and access-efficient routines that improve on Example 1 and look something like Example 2? Or would I be better off trusting the internal buffering AWK and the system use to process input files, and just make several passes over the same input file?
EDIT 2015.09.18: (I am not registered in this forum yet, so answering a comment here.) I am using gawk 4.1.3 on a non-Unix platform. I am interested in having a small portable environment in which to do certain types of computing. I do not know enough about gawk internals, and thought perhaps someone reading this forum had tried something like this before. I will end up profiling the different ways if I receive no other suggestions. END EDIT 2015.09.18
Gerhard "Ask Me About System Tweaking" Paseman, 2015.09.16