1

I had a problem solved in a previous post using the awk, but now I want to put an if loop in it, but I am getting an error.

Here's the problem:

I had a lot of files that looked like this:

 Header
 175566717.000
 175570730.000
 175590376.000
 175591966.000
 175608932.000
 175612924.000
 175614836.000
 .
 .
 .
 175680016.000
 175689679.000
 175695803.000
 175696330.000

And I wanted to extract the first 2000 lines (line 1 to 2000), then extract the lines 1500 to 3500, then 3000 to 5000 and so on... What I mean is: extract a window of 2000 lines with an overlap of 500 lines between contiguous windows until the end of the file.

This is the awk command used for it:

awk -v i=1 -v t=2000 -v d=501 'NR>1{a[NR-1]=$0}END{
    while(i<NR-1){
        ++n;
        for(k=i;k<i+t;k++)print a[k] > "win"n".txt"; 
        close("_win"n".txt") 
        i=i+t-d
    }

}' myfile.txt
done

And I get several files with names win1.txt , win2.txt , win3.txt , etc...

My problem now is that because the file was not a multiple of 2000, my last window has less than 2000 lines. How can I put an if loop that would do this: if the last window had less than 2000 digital numbers, the previous window should had all the lines until the end of the file.

EXTRA INFO

When the windows are created, there is a line break at the end.That is why I needed the if loop to take into account a window of less than 2000 digital numbers, and not just lines.

3
  • A quick and dirty way, would be to do it afterwards (outside of awk but in the bash script). Get the filenames of the last 2 files, run wc on the last file and apply your test, if it's less than 2000 cat it to the second to last file. Commented Feb 14, 2014 at 17:51
  • Just out of curiosity, why do you want an overlap between files? Good luck. Commented Feb 14, 2014 at 18:02
  • @TimothyBrown thanks, that might work. I will try that. Commented Feb 14, 2014 at 19:27

2 Answers 2

1

If you don't have to use for some other reason, try the approach

#!/bin/bash
file="$(sed '/^\s*$/d' myfile.txt)"
sed -n 1,2000p <<< "$file"
first=1500
last=3500
max=$(wc -l <<< "$file" | awk '{print $1}')
while [[ $max -ge 2000 && $last -lt $((max+1500)) ]]; do
  sed -n "$first","$last"p <<< "$file"
  ((first+=1500))
  ((last+=1500))
done

Obviously this is going to be less fast than and more error prone for gigatic files, but should work in most cases.

Sign up to request clarification or add additional context in comments.

4 Comments

But this would spawn sed 1+N times as opposed to spawning awk once.
Where N=#_of_lines/2000, sure. Didn't say sed would be faster than awk
Why read the file into memory? Instead of using <<< "$file" to read from a string, just read from the file each time. The OS will cache the file in memory as well as your program will, so there should be no noticeable difference in performance.
@WilliamPursell Not a performance decision, couldn't figure out a simple way to both ignore empty lines and consider some line range at the same time.
1

Change the while condition to make it stop earlier:

while (i+t <= NR) {

Change the end condition of the for loop to compensate for the last output file being potentially bigger:

for (k = i; k < (i+t+t-d <= NR ? i+t : NR); k++)

The rest of your code can stay the same; although I took the liberty of removing the close statement (why was that?), and to set d=500, to make the output files really overlap by 500 lines.

awk -v i=1 -v t=2000 -v d=500 'NR>1{a[NR-1]=$0}END{
    while (i+t <= NR) {
        ++n;
        for (k=i; k < (i+t+t-d <= NR ? i+t : NR); k++) print a[k] > "win"n".txt"; 
        i=i+t-d
    }
}' myfile.txt

I tested it with small values of t and d, and it seems to work as requested.

One final remark: for big input files, I wouldn't encourage storing the whole thing in array a.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.