4

I have read similar questions about this topic but none of them help me with the following problem:

I have a bash script that looks like this:

#!/bin/bash

for filename  in /home/user/Desktop/emak/*.fa; do
    mkdir ${filename%.*}
    cd ${filename%.*}
    mkdir emak
    cd ..
done

This script basically does the following:

  • Iterate through all files in a directory
  • Create a new directory with the name of each file
  • Go inside the new file and create a new file called "emak"

The real task does something much computational expensive than create the "emak" file...

I have about thousands of files to iterate through. As each iteration is independent from the previous one, I will like to split it in different processors ( I have 24 cores) so I can do multiples files at the same time.

I read some previous post about running in parallel (using: GNU) but I do not see a clear way to apply it in this case.

thanks

4
  • 1
    Have you made an attempt yourself using GNU parallel? It would be good to see that. Commented Nov 24, 2015 at 15:00
  • parallel -j $((getconf _NPROCESSORS_ONLN-1)) <your script name> Commented Nov 24, 2015 at 15:12
  • BTW, run your code through shellcheck.net to have quoting bugs found automatically so we don't need to point them out here. (If you had spaces in your filenames, the current code would behave badly). Commented Nov 24, 2015 at 17:05
  • @rai Default is number of cores. -j-1 == number of cores minus one. Commented Nov 24, 2015 at 17:45

2 Answers 2

7

No need for parallel; you can simply use

N=10
for filename in /home/user/Desktop/emak/*.fa; do
    mkdir -p "${filename%.*}/emak" &
    (( ++count % N == 0)) && wait
done

The second line pauses every Nth job to allow all the previous jobs to complete before continuing.

Sign up to request clarification or add additional context in comments.

6 Comments

Nice. Also, much more efficient than the GNU parallel approach of spinning up new shell instances.
...though really, reducing the number of individual mkdir calls would help performance even further. Perhaps one might want to pipe into xargs -0 -P 0 mkdir -p? That also avoids the wasted CPU from waiting for all N processes to finish whenever we get to a wait before starting a new batch.
I started working on something like find ... -exec mkdir -p {} +, but lost interest in figuring out how to combine that with stripping .fa from filename. Free rep for anyone who wants to pursue that! :)
-exec bash -c 'mkdir -p "${@%.*}"' {} +, perhaps?
That strips the .fa, but doesn't add /emak to each.
|
5

Something like this with GNU Parallel, whereby you create and export a bash function called doit:

#!/bin/bash

doit() {
    dir=${1%.*}
    mkdir "$dir"
    cd "$dir"
    mkdir emak
}
export -f doit
parallel doit ::: /home/user/Desktop/emak/*.fa

You will really see the benefit of this approach if the time taken by your "computationally expensive" part is longer, or especially variable. If it takes, say up to 10 seconds and is variable, GNU Parallel will submit the next job as soon as the shortest of the N parallel processes completes, rather than waiting for all N to complete before starting the next batch of N jobs.

As a crude benchmark, this takes 58 seconds:

#!/bin/bash

doit() {
   echo $1
   # Sleep up to 10 seconds
   sleep $((RANDOM*11/32768))
}
export -f doit
parallel -j 10 doit ::: {0..99}

and this is directly comparable and takes 87 seconds:

#!/bin/bash
N=10
for i in {0..99}; do
    echo $i
    sleep $((RANDOM*11/32768)) &
    (( ++count % N == 0)) && wait
done

5 Comments

Sure, though in this particular case I'd argue that the overhead of spinning up a new child-process shell to run each copy of this function per directory will be far, far more expensive than the time saved by the parallelization itself.
@CharlesDuffy OP says that the actual process is "far more computationally expensive"
sigh. I wish people would put a sleep 3 # do something expensive here in their examples to demonstrate that kind of thing.
This one works really well!! Each iteration takes 47seconds. with the doit functions it takes 50 seconds to do 24 iterations. I tried with 48 files (48 iterations) and it takes 100 seconds. It works in blocks of 24, it seems to me that it does in this way because I have 24 cores. am I right? thanks so much!
Correct! You can also use parallel --eta to get an estimate of when it will finish (Estimated Time of Arrival) and parallel -j 16 to run on 16 cores, for example. Also, if you have multiple servers available, you can distribute the jobs across multiple machines just by adding them to the command line - check any GNU Parallel tutorial. To be fair to @chepner you should change his N=10 to N=24.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.