2

I’m trying to parallelize the following script:

$ awk -F , '$3 > 25 && $3 < 26' data_temp.csv | head

... for which I am getting the desired output. (Same for cat data_temp.csv | awk -F , '$3 > 25 && $3 < 26' | head.) My attempts so far:

$ parallel "awk -F , '$3 > 25 && $3 < 26' data_temp.csv" | head
parallel: Warning: Input is read from the terminal.
parallel: Warning: Only experts do this on purpose. Press CTRL-D to exit.

$ cat data_temp.csv | parallel --pipe awk -F , \'$3 > 25 && $3 < 26\' | awk -F , '$3 > 25 && $3 < 26' | head
sh: -c: line 0: unexpected EOF while looking for matching `''
sh: -c: line 1: syntax error: unexpected end of file
# repeated for what looks like every line
0

1 Answer 1

4

Untested:

cat data_temp.csv |
  parallel -k -q --block 100M --pipe awk -F , '$3 > 25 && $3 < 26' |
  head
parallel -k -q --block 100M --pipepart -a data_temp.csv awk -F , '$3 > 25 && $3 < 26' |
  head
Sign up to request clarification or add additional context in comments.

4 Comments

If you are CPU constrained, the second will be faster. If you are disk I/O constrained only your measurements will tell.
Does it spread the tasks equally among all available cores, or do I have to specify how many cores to use? Also, in another thread (on general advice on what to do for this type of task), someone suggested dividing the file into several files, and then operating on those in parallel with awk. Is there a difference between that approach and the code you wrote (i.e., does your code do that automatically)? I can divide the files offline (before the operations). Also, if my file is 500 MB, and I have eight cores, does it make sense to do 60 GB rather than 100 for the block?
GNU Parallel spreads it equally. It is a good idea to use a blocksize so small that you get at least 10 blocks per core. E.g. 80 GB on 8 cores should at most use --block 1G. If your file is 800M on 8 cores a good value is --block 10M. --pipepart is so efficient, that you will only waste time if you split into smaller files, so do not do that. But in general I will encourage you to try different values and measure: Computers are complex systems and YMMV.
Thanks! Apparently I have 16 cores on my company’s server, so the munging is flying.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.