Parallelization of perl script

Question

I have a perl script I wish to parrallelise.

It is composed of a while loop with over 11000 lines inside of another while loop of 3400 lines, which makes it extremely slow.

open (FILE1, "File1.txt") or die "Can't open File1";
open (OUT, ">Outfile.txt");

while (<FILE1>)
{ 
 my @data=split (/ /, $_);
 my $RS=1;
 open (FILE2, "File2.txt") or die "Can't open File2";
 while (<FILE2>)
 { 
  my @value=split (/ /, $_);
  if ($data[$RS] == 1) {print OUT $value[1];$RS++;}
  elsif ($data[$RS] == 2) {print OUT $value[2];$RS++;}
  elsif ($data[$RS] == 0) {print OUT $value[3];$RS++;}
 }
 close FILE2;

}

I'm looking for a way to do the equivalent of qsub with every line of File1 so I can send 3440 jobs. Any suggestions? I'd like to stay with perl if possible. I tried to insert this code inside of a bash script, but I don't really understand how to insert a language inside another one.

My File1 contains a list of ID with information in column. Each column is then related to a single line in File2. I'd like to be able to run the second loop for multiple ID simultaneously instead of one after another.

File1
ID     RS_10    RS_15    RS_30
23     1        0        1
34     2        2        0
45     1        1        0
23     0        0        2
10     2        1        1


File2
RS_10 A B C
RS_15 D E F
RS_30 G H I

How something is best parallelized depends very much on your algorithm. Your code snippet does not show useful code (It won't run doing anything useful, won't compile under strict, and has bad style). Also, I don't know what qsub is. Could you elaborate a bit more what you are actually trying to do, to avoid an XY-problem? Maybe there is an algorithm for your task that is faster than the current O(n·m). — amon
– amon, Commented Jul 24, 2013 at 16:08
The problem is that your solution is O(N^2). You could maybe load one of the files into a hash to make this O(N). Also, accessing a file from a hash will be faster than reading it from disk. — ikegami
– ikegami, Commented Jul 24, 2013 at 16:28
Do you really think that 3440 jobs fighting for a CPU core would perform well? — innaM
– innaM, Commented Jul 24, 2013 at 16:30
@amon I voluntarily didn't write my conditions in the second loop because they are already working. This is just a simplified version of my code. — RNB
– RNB, Commented Jul 24, 2013 at 17:16
My point is that you don't neccessarily want to parallelize your code. Actually, you want to speed up your script. Showing example data was very helpful. Is it correct that each line in File1 actually has 3441 columns? What will you do with your two tables, and what is the output? have you considered using an SQL database? I have too little information to provide qualified suggestions. — amon
– amon, Commented Jul 24, 2013 at 17:28

amon · Accepted Answer · 2013-07-24 19:29:25Z

The first rule of optimization is not to do it too early (i.e. jumping to premature conclusions without profiling your code).

The second rule would probably refer to caching.

The File2 of yours isn't very large. I'd say we load it into memory. This has the following advantages:

We do our parsing once and only once.
The file isn't obscenly large, so space isn't much of an issue.
We can create a data structure that makes lookups very simple.

About that first point: You split each line over three thousand times. Those cycles could have been better spent.

About that third point: you seem to do an index conversion:

1 → 1, 2 → 2, 0 → 3

Instead of testing for all values with an if/elsif-switch (linear complexity), we could use an array that does this translation (constant time lookups):

my @conversion = (3, 1, 2);
...;
print OUT $value[$conversion[$data[$RS++]]];

If this index conversion is constant, we could do it once and only once when parsing File2. This would look like

use strict; use warnings;
use autodie; # automatic error handling

my @file2;
{
  open my $file2, "<", "File2.txt";
  while (<$file2>) {
    my (undef, @vals) = split;

    # do the reordering. This is equivalent to @vals = @vals[2, 0, 1];
    unshift @vals, pop @vals;

    push @file2, \@vals;
  }
}

Now we can move on to iterating through File1. Printing the corresponding entry from File2 now looks like

open my $file1, "<", "File1.txt";
<$file1>; # remove header
while (<$file1>) {
  my ($id, @indices) = split;
  print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
  # but I guess you'd want some separator in between
  # If so, set the $, variable
}

This algorithm is still quadratic (the map is just a for-loop in disguise), but this should have a better constant factor. The output of above code given your example input is

23 A F G
34 B E I
45 A D I
23 C F H
10 B D G

(with $, = " "; $\ = "\n").

Where to go from here

This last step (looping through File1) could be parallelized, but this is unlikely to help much: IO is slow, communication between threads is expensive (IPC even more so), and the output would be in random order. We could spawn a bunch of workers, and pass unparsed lines in a queue:

use threads; # should be 1st module to be loaded
use Thread::Queue;
use constant NUM_THREADS => 4; # number of cores

# parse the File2 data here

my $queue = Thread::Queue->new;
my @threads = map threads->new(\&worker), 1 .. NUM_THREADS;

# enqueue data
$queue->enqueue($_) while <$file1>;
# end the queue
$queue->enqueue((undef) x NUM_THREADS); # $queue->end in never versions

# wait for threads to complete
$_->join for @threads;

sub worker {
  while(defined(my $_ = $queue->dequeue)) {
    my ($id, @indices) = split;
    print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
  }
}

Note that this copies the @file2 into all threads. Fun fact: for the example data, this threaded solution takes roughly 4× as long. This is mostly the overhead of thread creation, so this will be less of an issue for your data.

In any case, profile your code to see where you can optimize most effectively. I recommend the excellent Devel::NYTProf. E.g. for my non-threaded test run with this very limited data, the overhead implied by autodie and friends used more time than doing the actual processing. For you, the most expensive line would probably be

  print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;

but there isn't much we can do here inside Perl.

Collectives™ on Stack Overflow

Parallelization of perl script

1 Answer 1

Where to go from here

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Where to go from here

Comments

Your Answer

Sign up or log in

Post as a guest

Related