bash script for many files in parallel

Question

I have read similar questions about this topic but none of them help me with the following problem:

I have a bash script that looks like this:

#!/bin/bash

for filename  in /home/user/Desktop/emak/*.fa; do
    mkdir ${filename%.*}
    cd ${filename%.*}
    mkdir emak
    cd ..
done

This script basically does the following:

Iterate through all files in a directory
Create a new directory with the name of each file
Go inside the new file and create a new file called "emak"

The real task does something much computational expensive than create the "emak" file...

I have about thousands of files to iterate through. As each iteration is independent from the previous one, I will like to split it in different processors ( I have 24 cores) so I can do multiples files at the same time.

I read some previous post about running in parallel (using: GNU) but I do not see a clear way to apply it in this case.

thanks

Have you made an attempt yourself using GNU parallel? It would be good to see that. — Tom Fenech
– Tom Fenech, Commented Nov 24, 2015 at 15:00
parallel -j $((getconf _NPROCESSORS_ONLN-1)) <your script name> — notrai
– notrai, Commented Nov 24, 2015 at 15:12
BTW, run your code through shellcheck.net to have quoting bugs found automatically so we don't need to point them out here. (If you had spaces in your filenames, the current code would behave badly). — Charles Duffy
– Charles Duffy, Commented Nov 24, 2015 at 17:05
@rai Default is number of cores. -j-1 == number of cores minus one. — Ole Tange
– Ole Tange, Commented Nov 24, 2015 at 17:45

chepner · Accepted Answer · 2015-11-24 17:19:43Z

7

No need for parallel; you can simply use

N=10
for filename in /home/user/Desktop/emak/*.fa; do
    mkdir -p "${filename%.*}/emak" &
    (( ++count % N == 0)) && wait
done

The second line pauses every Nth job to allow all the previous jobs to complete before continuing.

edited Nov 24, 2015 at 17:19

answered Nov 24, 2015 at 16:38

chepner

538k77 gold badges594 silver badges746 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Charles Duffy Over a year ago

Nice. Also, much more efficient than the GNU parallel approach of spinning up new shell instances.

Charles Duffy Over a year ago

...though really, reducing the number of individual mkdir calls would help performance even further. Perhaps one might want to pipe into xargs -0 -P 0 mkdir -p? That also avoids the wasted CPU from waiting for all N processes to finish whenever we get to a wait before starting a new batch.

chepner Over a year ago

I started working on something like find ... -exec mkdir -p {} +, but lost interest in figuring out how to combine that with stripping .fa from filename. Free rep for anyone who wants to pursue that! :)

Charles Duffy Over a year ago

-exec bash -c 'mkdir -p "${@%.*}"' {} +, perhaps?

chepner Over a year ago

That strips the .fa, but doesn't add /emak to each.

|

Mark Setchell · Accepted Answer · 2015-11-24 17:43:09Z

5

Something like this with GNU Parallel, whereby you create and export a bash function called doit:

#!/bin/bash

doit() {
    dir=${1%.*}
    mkdir "$dir"
    cd "$dir"
    mkdir emak
}
export -f doit
parallel doit ::: /home/user/Desktop/emak/*.fa

You will really see the benefit of this approach if the time taken by your "computationally expensive" part is longer, or especially variable. If it takes, say up to 10 seconds and is variable, GNU Parallel will submit the next job as soon as the shortest of the N parallel processes completes, rather than waiting for all N to complete before starting the next batch of N jobs.

As a crude benchmark, this takes 58 seconds:

#!/bin/bash

doit() {
   echo $1
   # Sleep up to 10 seconds
   sleep $((RANDOM*11/32768))
}
export -f doit
parallel -j 10 doit ::: {0..99}

and this is directly comparable and takes 87 seconds:

#!/bin/bash
N=10
for i in {0..99}; do
    echo $i
    sleep $((RANDOM*11/32768)) &
    (( ++count % N == 0)) && wait
done

edited Nov 24, 2015 at 17:43

answered Nov 24, 2015 at 16:58

Mark Setchell

210k32 gold badges309 silver badges503 bronze badges

5 Comments

Charles Duffy Over a year ago

Sure, though in this particular case I'd argue that the overhead of spinning up a new child-process shell to run each copy of this function per directory will be far, far more expensive than the time saved by the parallelization itself.

Mark Setchell Over a year ago

@CharlesDuffy OP says that the actual process is "far more computationally expensive"

Charles Duffy Over a year ago

sigh. I wish people would put a sleep 3 # do something expensive here in their examples to demonstrate that kind of thing.

aspire57 Over a year ago

This one works really well!! Each iteration takes 47seconds. with the doit functions it takes 50 seconds to do 24 iterations. I tried with 48 files (48 iterations) and it takes 100 seconds. It works in blocks of 24, it seems to me that it does in this way because I have 24 cores. am I right? thanks so much!

Mark Setchell Over a year ago

Correct! You can also use parallel --eta to get an estimate of when it will finish (Estimated Time of Arrival) and parallel -j 16 to run on 16 cores, for example. Also, if you have multiple servers available, you can distribute the jobs across multiple machines just by adding them to the command line - check any GNU Parallel tutorial. To be fair to @chepner you should change his N=10 to N=24.

Collectives™ on Stack Overflow

bash script for many files in parallel

2 Answers 2

6 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related