Inserting if loop within awk

Question

I had a problem solved in a previous post using the awk, but now I want to put an if loop in it, but I am getting an error.

Here's the problem:

I had a lot of files that looked like this:

 Header
 175566717.000
 175570730.000
 175590376.000
 175591966.000
 175608932.000
 175612924.000
 175614836.000
 .
 .
 .
 175680016.000
 175689679.000
 175695803.000
 175696330.000

And I wanted to extract the first 2000 lines (line 1 to 2000), then extract the lines 1500 to 3500, then 3000 to 5000 and so on... What I mean is: extract a window of 2000 lines with an overlap of 500 lines between contiguous windows until the end of the file.

This is the awk command used for it:

awk -v i=1 -v t=2000 -v d=501 'NR>1{a[NR-1]=$0}END{
    while(i<NR-1){
        ++n;
        for(k=i;k<i+t;k++)print a[k] > "win"n".txt"; 
        close("_win"n".txt") 
        i=i+t-d
    }

}' myfile.txt
done

And I get several files with names win1.txt , win2.txt , win3.txt , etc...

My problem now is that because the file was not a multiple of 2000, my last window has less than 2000 lines. How can I put an if loop that would do this: if the last window had less than 2000 digital numbers, the previous window should had all the lines until the end of the file.

EXTRA INFO

When the windows are created, there is a line break at the end.That is why I needed the if loop to take into account a window of less than 2000 digital numbers, and not just lines.

A quick and dirty way, would be to do it afterwards (outside of awk but in the bash script). Get the filenames of the last 2 files, run wc on the last file and apply your test, if it's less than 2000 cat it to the second to last file. — Timothy Brown
– Timothy Brown, Commented Feb 14, 2014 at 17:51
Just out of curiosity, why do you want an overlap between files? Good luck. — shellter
– shellter, Commented Feb 14, 2014 at 18:02

Reinstate Monica Please · Accepted Answer · 2014-02-14 18:45:52Z

1

If you don't have to use awk for some other reason, try the sed approach

#!/bin/bash
file="$(sed '/^\s*$/d' myfile.txt)"
sed -n 1,2000p <<< "$file"
first=1500
last=3500
max=$(wc -l <<< "$file" | awk '{print $1}')
while [[ $max -ge 2000 && $last -lt $((max+1500)) ]]; do
  sed -n "$first","$last"p <<< "$file"
  ((first+=1500))
  ((last+=1500))
done

Obviously this is going to be less fast than awk and more error prone for gigatic files, but should work in most cases.

edited Feb 14, 2014 at 18:45

answered Feb 14, 2014 at 18:20

Reinstate Monica Please

11.7k3 gold badges29 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

kojiro Over a year ago

But this would spawn sed 1+N times as opposed to spawning awk once.

Reinstate Monica Please Over a year ago

Where N=#_of_lines/2000, sure. Didn't say sed would be faster than awk

William Pursell Over a year ago

Why read the file into memory? Instead of using <<< "$file" to read from a string, just read from the file each time. The OS will cache the file in memory as well as your program will, so there should be no noticeable difference in performance.

Reinstate Monica Please Over a year ago

@WilliamPursell Not a performance decision, couldn't figure out a simple way to both ignore empty lines and consider some line range at the same time.

Ruud Helderman · Accepted Answer · 2014-02-14 22:49:18Z

Change the while condition to make it stop earlier:

while (i+t <= NR) {

Change the end condition of the for loop to compensate for the last output file being potentially bigger:

for (k = i; k < (i+t+t-d <= NR ? i+t : NR); k++)

The rest of your code can stay the same; although I took the liberty of removing the close statement (why was that?), and to set d=500, to make the output files really overlap by 500 lines.

awk -v i=1 -v t=2000 -v d=500 'NR>1{a[NR-1]=$0}END{
    while (i+t <= NR) {
        ++n;
        for (k=i; k < (i+t+t-d <= NR ? i+t : NR); k++) print a[k] > "win"n".txt"; 
        i=i+t-d
    }
}' myfile.txt

I tested it with small values of t and d, and it seems to work as requested.

One final remark: for big input files, I wouldn't encourage storing the whole thing in array a.

Collectives™ on Stack Overflow

Inserting if loop within awk

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related