1

I have been having issues summing a very big array (millions of numbers), and was trying to sum all the values inside but it keeps failing (giving me 0 from the initial component). Below is my code:

Map.sh

#/bin/bash

file="myfile.csv"
data=`tail -n +2 $file |  cut -d"," -f 4`
data1=()
for i in $data;
do
data1+=($i)
done;
count=${#data1[@]}
export count
export data1
export data
./reduce.sh

reduce.sh

#/bin/bash
echo $count
sum=0
for i in "${data1[@]}"; do
        sum = $((sum + $i))
done;
echo $sum

I have tried almost every single variable I have found online but none works. Am I missing something?

data example: I am looking at this column (4): enter image description here

and it extends by millions.

5
  • 2
    Post an input example Commented Mar 18, 2020 at 15:12
  • @Daniel : Did you verify the content of your data1 array? BTW, what is the purpose of turning your shell variables into environment variables? Aside from the fact that a bash array can not be exported, you don't have any child process which would benefit from the export. Commented Mar 18, 2020 at 15:26
  • I do get this message when I try to get the count in the reduce script: Argument list too long. So I guess that is the issue. Can you think of any solutions? Commented Mar 18, 2020 at 15:34
  • Side note: sum = $((sum + $i)) is wrong (blanks around =); shellcheck.net tells you things like that. Commented Mar 18, 2020 at 16:07
  • Also protect base10 against base 8 interpretation if values contains leading zeros: sum=$((sum+10#i)) or with bash: sum+=$((10#i)). Anyway using a shell to iterate over a large data set is not appropriate. read -r sum < <(IFS='+'; printf '%s\n' "${data1[*]}" | bc -l) or read -r sum < <(tail -n +2 "$file" | cut -d ',' -f 4 | tr '[:space:]' '+' | bc -l) Commented Mar 18, 2020 at 16:53

2 Answers 2

2

With GNU datamash:

datamash --header-in -t',' sum 4 < myfile.csv

This builds the sum of the values of the fourth field of the comma separated input file. The header line is skipped.

Sign up to request clarification or add additional context in comments.

Comments

2

Would this awk work for you:

$ awk -F, '       # comma delimiter
FNR>1 {           # skip header record
    sum+=$4       # sum 4th field values to sum var
}
END {             # in the end
    print sum     # output the sum
}' file

5 Comments

maybe add skip of header line
./reduce.sh: Argument list too long. Plus i have to use the array data1, this is following structure MapReduce where the key and separate number of numbers are taken and stored in the first scrip, and then carried over and summed in the second one.
@Daniel: If you need the count as well as the sum, change print sum to print NR-1, sum. Or, if you need them on separate lines, use two print statements. JamesBrown's awk script replaces your reduce.sh.
The limitation of awk is that it uses double-precision floating point arithmetic, so the results are only precise up to 2**53. I think you should be OK here, since it seems like your sum is in the millions or maybe billions, but not yet quadrillions. But it's a serious limitation in a MapReduce environment.
But so long as you are summing small whole-numbers, you shouldn't have any rounding issue, and a benefit of the awk solution is that it will be Orders-Of-Magnitude faster than a shell script solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.