Running shell script in parallel

Question

I have a shell script which

shuffles a large text file (6 million rows and 6 columns)
sorts the file based the first column
outputs 1000 files

So the pseudocode looks like this

file1.sh 

#!/bin/bash
for i in $(seq 1 1000)
do

  Generating random numbers here , sorting  and outputting to file$i.txt  

done

Is there a way to run this shell script in parallel to make full use of multi-core CPUs?

At the moment, ./file1.sh executes in sequence 1 to 1000 runs and it is very slow.

Thanks for your help.

If you find yourself needing to anything non trivial (e.g. multiprocessing etc.) in a shell script, it's time to rewrite it in a proper programming language. — Noufal Ibrahim
– Noufal Ibrahim, Commented Apr 5, 2011 at 6:17

Jonathan Dursi · Accepted Answer · 2013-06-20 16:26:03Z

97

Another very handy way to do this is with gnu parallel, which is well worth installing if you don't already have it; this is invaluable if the tasks don't necessarily take the same amount of time.

seq 1000 | parallel -j 8 --workdir $PWD ./myrun {}

will launch ./myrun 1, ./myrun 2, etc, making sure 8 jobs at a time are running. It can also take lists of nodes if you want to run on several nodes at once, eg in a PBS job; our instructions to our users for how to do that on our system are here.

Updated to add: You want to make sure you're using gnu-parallel, not the more limited utility of the same name that comes in the moreutils package (the divergent history of the two is described here.)

edited Jun 20, 2013 at 16:26

answered Apr 5, 2011 at 12:22

Jonathan Dursi

51.1k10 gold badges131 silver badges160 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Tony Over a year ago

@Jonathan- Thanks for the pointer. I will ask my system administrator to install GNU parallel. It seems a useful utility to have on the system. Actually I was going to post the question on PBS, but you have already answered it. Cheers

Ole Tange Over a year ago

If you sysadmin will not install it, it is easy to install yourself: Simply copy the perl script 'parallel' to a dir in your path and you are done. No compilation or installation of libraries needed.

Tony Over a year ago

@Ole - Thanks for the tip. My sysadmin has agreed to install it on the system.

Tony Over a year ago

@Jonathan- When you refer to ./myrun, is it the modified script with "&" and "wait" or without them, that is the original shell script? Cheers

Jonathan Dursi Over a year ago

It turns out the moreutils package includes not gnu-parallel but Tollef's; the history of the evolution of the tools is at gnu.org/software/parallel/history.html

|

Anders Lindahl · Accepted Answer · 2011-04-05 08:13:11Z

46

Check out bash subshells, these can be used to run parts of a script in parallel.

I haven't tested this, but this could be a start:

#!/bin/bash
for i in $(seq 1 1000)
do
   ( Generating random numbers here , sorting  and outputting to file$i.txt ) &
   if (( $i % 10 == 0 )); then wait; fi # Limit to 10 concurrent subshells.
done
wait

edited Apr 5, 2011 at 8:13

answered Apr 5, 2011 at 6:02

Anders Lindahl

43.3k9 gold badges93 silver badges94 bronze badges

10 Comments

Tony Delroy Over a year ago

That will kick off all the thousand tasks in parallel, which might lead to too much swapping / contention for optimal work throughput, but it's certainly a reasonable and easy way to get started.

Anders Lindahl Over a year ago

Good point! The simplest solution would be to have an outer loop that limits the number of started subshells and wait between them.

Tony Delroy Over a year ago

@Anders: or just slip an "if (( $i % 10 == 0 )); then wait; fi" before the "done" in your loop above...

Anders Lindahl Over a year ago

@Tony: I think it makes sense to leave it in. wait with no subshells running seems to do nothing, and if choose a number of concurrent subshells that isn't a factor of the number of tasks to run we might get active subshells still running when the loop ends.

Ole Tange Over a year ago

This solution works best if all the jobs take exactly the same time. If the jobs do not take the same time you will waste CPU time waiting for one of the long jobs to finish. In other words: It will not keep 10 jobs running at the same time at all times.

|

Tony Delroy · Accepted Answer · 2011-04-05 05:58:48Z

17

To make things run in parallel you use '&' at the end of a shell command to run it in the background, then wait will by default (i.e. without arguments) wait until all background processes are finished. So, maybe kick off 10 in parallel, then wait, then do another ten. You can do this easily with two nested loops.

answered Apr 5, 2011 at 5:58

Tony Delroy

107k16 gold badges188 silver badges265 bronze badges

6 Comments

Tony Over a year ago

Many thanks for your suggestions. All CPUs are now working. Do you have any idea how to make it run across the nodes? I am submitting the job to High Performance Computing using PBS with nodes=2:ppn=8, but only 1 node is working.

Tony Delroy Over a year ago

@Tony: I'd never heard of PBS until now... sounds interesting, but I've no idea how to use it. Sorry!

Jonathan Dursi Over a year ago

For the PBS question and across nodes, see stackoverflow.com/questions/5453427/… .

d-b Over a year ago

How does WAIT work? Can you update your answer with an example? I want to run several threads in a certain function but the next function must not start until all these threads are finished.

Tony Delroy Over a year ago

@d-b wait waits for background processes to finish, not threads. For example, for FILE in huge.txt massive.log enormous.xml; do scp $FILE someuser@somehost:/tmp/ &; done; wait; echo "finished" would run three scp (secure copy) processes to copy three files in parallel to a remove host's /tmp directory, and only output "finished" after all three copies were completed.

|

Eric O. Lebigot · Accepted Answer · 2011-09-21 18:54:16Z

9

There is a whole list of programs that can run jobs in parallel from a shell, which even includes comparisons between them, in the documentation for GNU parallel. There are many, many solutions out there. Another good news is that they are probably quite efficient at scheduling jobs so that all the cores/processors are kept busy at all times.

answered Sep 21, 2011 at 18:54

Eric O. Lebigot

95k49 gold badges223 silver badges263 bronze badges

Comments

Andiamo Va · Accepted Answer · 2017-02-25 00:19:10Z

4

There is a simple, portable program that does just this for you: PPSS. PPSS automatically schedules jobs for you, by checking how many cores are available and launching another job every time another one just finished.

edited Feb 25, 2017 at 0:19

Andiamo Va

73 bronze badges

answered Sep 21, 2011 at 18:47

Eric O. Lebigot

95k49 gold badges223 silver badges263 bronze badges

Comments

Robert J · Accepted Answer · 2024-05-10 06:38:10Z

1

While the previous answers do work, IMO they can be hard to remember (except of course GNU parallel).

I am somewhat partial to a similar approach to the above (( $i % 10 == 0 )) && wait. I have also seen this written as ((i=i%N)); ((i++==0)) && wait

where: N is defined as the number of jobs that you want to run in parallel and i is the current job.

While the above approach works, it has diminishing returns as you have to wait for all processes to quit before having a new set of processes work, and this wastes CPU time for any task with any execution time (A.K.A. every task). In other words, the number of parallel tasks must reach 0 before starting new tasks with the previously described approach.

For me, this issue became apparent when executing a task with an inconsistent execution time (e.g. executing a request to purge user information from a database - the requestee might or might not exist, and if they do exist there could be orders of magnitudes of differences for records associated with different requestees). What I notices was some requests would be immediately fulfilled, while others would be queued to start waiting for one slightly longer running task to succeed. This translated to a task that would take hours/days to complete with the previously defined approach only taking tens of minutes.

I think that the below approach is a better solution for maintaining a constant task loading on systems without GNU parallel (e.g. vanilla macOS) and hopefully easier to remember than the above alphabet soup:

WORKER_LIMIT=6 # or whatever - remember to not bog down your system

while read -r LINE; do # this could be any kind of loop
    # there's probably a more elegant approach to getting the number of background processes.
    BACKGROUND_PROCESSES="$(jobs -r | wc -l | tr -d ' ')"

    if [[ $BACKGROUND_PROCESSES -eq $WORKER_LIMIT ]]; then
        # wait for 1 job to finish before starting a new one
        wait -n 
    fi

    # run something in a background shell
    python example.py -item "$LINE" &
done < something.list

# wait for all background jobs to finish
wait

edited May 10, 2024 at 6:38

answered Apr 16, 2022 at 22:48

Robert J

99811 silver badges25 bronze badges

5 Comments

mgutt Over a year ago

What is the purpose of grep? wc -l already returns a single number I think. I would use if [[ $(jobs | wc -l) -ge $WORKER_LIMIT ]]; then instead. And what I suggest in addition is to add trap "jobs -p | xargs kill 2>/dev/null" EXIT at the top of the script which allows CTRL+C to kill all background jobs (or they would run until they are finished).

mgutt Over a year ago

Use this if bash does not support -n: while [[ $(jobs | wc -l) -ge $WORKER_LIMIT ]]; do sleep 1; done

Robert J Over a year ago

@mgutt It looks like I was using the grep to remove white spaces from wc -l. This can be exemplified with by using wc -c to count the number of characters: BACKGROUND_PROCESSES="$(jobs -r | wc -l)"; echo "$BACKGROUND_PROCESSES" | wc -c. Looking back on this I think tr -d ' ' would've been more clear.

mgutt Over a year ago

But wc -l does not return any whitespace?!

Robert J Over a year ago

It does for me. When I execute the above on macOS I get the character count with wc -c, I get 9 where as with the tr -d ' ' I get 2 (one being the line return).

Zakaria · Accepted Answer · 2018-03-21 15:47:24Z

0

IDLE_CPU=1
NCPU=$(nproc)

int_childs() {
    trap - INT
    while IFS=$'\n' read -r pid; do
        kill -s SIGINT -$pid
    done < <(jobs -p -r)
    kill -s SIGINT -$$
}

# cmds is array that hold commands
# the complex thing is display which will handle all cmd output
# and serialized it correctly

trap int_childs INT
{
    exec 2>&1
    set -m

    if [ $NCPU -gt $IDLE_CPU ]; then
        for cmd in "${cmds[@]}"; do
            $cmd &
            while [ $(jobs -pr |wc -l) -ge $((NCPU - IDLE_CPU)) ]; do
                wait -n
            done
        done
        wait

    else
        for cmd in "${cmds[@]}"; do
            $cmd
        done
    fi
} | display

answered Mar 21, 2018 at 15:47

Zakaria

8511 gold badge7 silver badges14 bronze badges

Comments

jreisinger · Accepted Answer · 2020-01-04 22:25:49Z

0

You might wanna take a look at runp. runp is a simple command line tool that runs (shell) commands in parallel. It's useful when you want to run multiple commands at once to save time. It's easy to install since it's a single binary. It's been tested on Linux (amd64 and arm) and MacOS/darwin (amd64).

answered Jan 4, 2020 at 22:25

jreisinger

1,7031 gold badge12 silver badges24 bronze badges

Comments

Bash Coder · Accepted Answer · 2012-07-04 19:54:00Z

-2

generating random numbers is easy. suppose u got a huge file like a shop database and u want to rewrite that file on some specific basis. My idea was to calculate number of cores, split file into how many cores, make a script.cfg file , split.sh and recombine.sh split.sh will split file in how many cores, clone script.cfg ( script that changes stuff in that huge files), clone script.cgf in how many cores, make them executable, search and replace in clones some variables that have to know what part of the file to process and run them in background when a clone is done generate a clone$core.ok file, so when all clones are done will tell to a loop to recombine partial results into a single one only when all .ok files are generated. it can be done with " wait" but i fancy my way

http://www.linux-romania.com/product.php?id_product=76 look at the bottom ,is partially translated in EN in this way i can procces 20000 articles with 16 columns in 2 minutes(quad core) instead of 8(single core) You have to care about CPU temperature, coz all cores are running at 100%

answered Jul 4, 2012 at 19:54

Bash Coder

11 bronze badge

2 Comments

t0mm13b Over a year ago

Qué habla Inglés? Please refrain from text speak, u, coz, ... You certainly typed out other words fine but not the little words - clear laziness obviously!

jrw32982 Over a year ago

Plus the link is broken.

Collectives™ on Stack Overflow

Running shell script in parallel

9 Answers 9

10 Comments

10 Comments

6 Comments

Comments

Comments

5 Comments

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

10 Comments

10 Comments

6 Comments

Comments

Comments

5 Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related