0

I am wrote a simple script to extract text from a bunch of files (*.out) and add two lines at the beginning and a line at the end. Then I add the extracted text with another file to create a new file. The script is here.

#!/usr/bin/env bash
#A simple bash script to extract text from *.out and create another file 
for f in *.out; do
#In the following line, n is a number which is extracted from the file name
        n=$(echo $f | cut -d_ -f6)
        t=$((2 * $n ))
#To extract the necessary text/data
        grep "  B  " $f | tail -${t} | awk 'BEGIN {OFS=" ";} {print $1, $4, $5, $6}' | rev | column -t | rev > xyz.xyz
#To add some text as the first, second and last lines.
        sed -i '1i -1 2' xyz.xyz
        sed -i '1i $molecule' xyz.xyz
        echo '$end' >> xyz.xyz
#To combine the extracted info with another file (ea_input.in)
        cat xyz.xyz ./input_ea.in > "${f/abc.out/pqr.in}"
     done

./script.sh: line 4: (ls file*.out | cut -d_ -f6: syntax error: invalid arithmetic operator (error token is ".out) | cut -d_ -f6")

How I can correct this error?

3
  • 1
    The immediate problem is too many parentheses; but really, don't use ls in scripts. Commented Aug 25, 2018 at 12:32
  • 1
    If you are using Awk anyway, refactor all of this to Awk. Running sed -i (twice!) on a file you just created with Awk is easily avoidable, and frankly quite horrible. Commented Aug 25, 2018 at 12:35
  • Thanks for the update. Could you please edit to also include the expected output for the sample input? Commented Aug 26, 2018 at 11:10

2 Answers 2

2

In bash, when you use:

$(( ... ))

it treats the contents of the brackets as an arithmetic expression, returning the result of the calculation, and when you use:

$( ... )

it executed the contents of the brackets and returns the output.

So, to fix your issue, it should be as simple as to replace line 4 with:

n=$(ls $f | cut -d_ -f6)

This replaces the outer double brackets with single, and removes the additional brackets around ls $f which should be unnecessary.

Sign up to request clarification or add additional context in comments.

12 Comments

Thanks. I changed line 4 as you suggested. Now, a new error raises. ./script.sh: line 5: 2 * 08: value too great for base (error token is "08")
A leading zero on a number in arithmetic context causes the shell to treat it as octal. Trimming leading zeros is easy per se.
How to trim the leading zeros in the script?
I added n=$(ls $f | cut -d_ -f6 | sed 's/^0*//'). Now another error raises: ./script.sh: line 5: 2 * : syntax error: operand expected (error token is "* ")
My guess is that some of the *.out files don't have an appropriate number in the 6th field, try echo ${n} before line 5 to debug it.
|
1

The arithmetic error can be avoided by adding spaces between parentheses. You are already using var=$((arithmetic expression)) correctly elsewhere in your script, so it should be easy to see why $( ((ls "$f") | cut -d_ -f6)) needs a space. But the subshells are completely superfluous too; you want $(ls "$f" | cut -d_ -f6). Except ls isn't doing anything useful here, either; use $(echo "$f" | cut -d_ -f6). Except the shell can easily, albeit somewhat clumsily, extract a substring with parameter substitution; "${f#*_*_*_*_*_}". Except if you're using Awk in your script anyway, it makes more sense to do this - and much more - in Awk as well.

Here is an at empt at refactoring most of the processing into Awk.

for f in *.out; do
     awk 'BEGIN {OFS=" " }
            # Extract 6th _-separated field from input filename
            FNR==1 { split(FILENAME, f, "_"); t=2*f[6] }
            # If input matches regex, add to array b
            /  B  / { b[++i] = $1 OFS $4 OFS $5 OFS $6 }
            # If array size reaches t, start overwriting old values
            i==t { i=0; m=t }
            END {
                # Print two prefix lines
                print "$molecule"; print -1, 2;
                # Handle array smaller than t
                if (!m) m=i
                # Print starting from oldest values (index i + 1) 
                for(j=1; j<=m; j++) {
                    # Wrap to beginning of array at end
                    if(i+j > t) i-=t
                    print b[i+j]; }
                print "$end" }' "$f" |
        rev | column -t | rev |
        cat -  ./input_ea.in > "${f/foo.out/bar.in}"
     done

Notice also how we avoid using a temporary file (this would certainly have been avoidable without the Awk refactoring, too) and how we take care to quote all filename variables in double quotes.

The array b contains (up to) the latest t values from matching lines; we collect these into an array which is constrained to never contain more than t values by wrapping the index i back to the beginning of the array when we reach index t. This "circular array" avoids keeping too many values in memory, which would make the script slow if the input file contains many matches.

8 Comments

I find it difficult to understand what is going on inside awk. Now, I am running the script (with set -x). I am processing this with 100 files. It take long time now.
For all the files, -1 2 is appended to input_ea.in file. The extracted text using / B / etc, is not appended. Also, variables $molecule and $end are not appended to input_ea.in.
I had to guess some things and obviously have no way to test this. Are molecule and end the names of Bash variables, or static strings which should be included verbatim?
set -x only affects the verbosity of Bash, not Awk. You can examine what's going on in the script by adding print statements at various points in the script.
It's still not clear if you have an undocumented pair of variables or if you want the literal text $molecule and $end around the excerpted values. I have changed the script to do the latter.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.