Memory sharing in parallelized python code

Question

I am a college freshman and Python newbie so bear with me. I am trying to parallelize some matrix operations. Here is my attempt using the ParallelPython module:

 def testfunc(connectionMatrix, qCount, iCount, Htry, tStepCount):
        test = connectionMatrix[0:qCount,0:iCount].dot(Htry[tStepCount-1, 0:iCount]) 
        return test  

    f1 = job_server.submit(testfunc, (self.connectionMatrix, self.qCount, self.iCount, self.iHtry, self.tStepCount), modules = ("scipy.sparse",))
    f2 = job_server.submit(testfunc, (self.connectionMatrix, self.qCount, self.iCount, self.didtHtry, self.tStepCount), modules = ("scipy.sparse",))
    r1 = f1()
    r2 = f2()
    self.qHtry[self.tStepCount, 0:self.qCount] = self.qHtry[self.tStepCount-1, 0:self.qCount] + self.delT * r1 + 0.5 * (self.delT**2) * r2

It seems that there is a normal curve with size of the matrix on the x-axis and the percent speed-up on the y-axis. It seems to cap out at a 30% speed increase at 100x100 matrices. Smaller and larger matrices result in less increase and with small enough and large enough matrices, the serial code is faster. My guess is that the problem lies within the passing of the arguments. The overhead of copying the large matrix is actually taking longer than the job itself. What can I do to get around this? Is there some way to incorporate memory sharing and passing the matrix by reference? As you can see, none of the arguments are modified so it could be read-only access.

Thanks.

Wow, I thought freshman programming was supposed to be easier than in my day. I guess the fact that you get to use Python instead of C or a bizarre lisp dialect with only half of the standard library means they can make you write more interesting programs. :) — abarnert
– abarnert, Commented Jul 20, 2012 at 23:10
I am kind of in over my head, but I believe I was the only one who applied to this summer program with significant programming experience, so I was the only option, ideal or not. But, I am still trying my darndest. — Nino
– Nino, Commented Jul 21, 2012 at 0:53

abarnert · Accepted Answer · 2012-07-20 01:56:41Z

1

Well, the point of ParallelPython is that you can write code that doesn't care whether it's distributed across threads, processes, or even multiple computers, and using memory sharing would break that abstraction.

One option is to use something like a file on a shared filesystem, where you mmap that file in each worker. Of course that's more complicated, and whether it's better or worse will depend on a lot of details about the filesystem, sharing protocol, and network, but it's an option.

If you're willing to give up the option of distributed processing, you can use multiprocessing.Array (or multiprocessing,Value, or multiprocessing.sharedctypes) to access shared memory. But at that point, you might want to consider just using multiprocessing instead of ParallelPython for the job distribution, since multiprocessing is part of the standard library, and has a more powerful API, and you're explicitly giving up the one major advantage of ParallelPython.

Or you could combine the two options, for the worst of both worlds in many ways, but maybe the best in terms of how little you need to change your existing code: Just use a local file and mmap it.

However, before you do any of this, you may want to consider profiling to see if copying the matrix really is the bottleneck. And, if it is, you may want to consider whether there's an algorithmic fix, just copying the part each job needs instead of copying the entire matrix. (Whether that makes sense depends on whether the part each job needs is significantly less than the whole thing, of course.)

answered Jul 20, 2012 at 1:56

abarnert

368k54 gold badges626 silver badges691 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Nino Over a year ago

To be honest, I am not entirely sure if the copying is the slowdown but when profiling, {method 'acquire' of 'thread.lock' objects} and {cPickle.dumps} take 66% of the time to run the entire script. I assumed these were the pickling and then the copying. The end goal of my summer project is to make use of a cluster of machines, so the multi-processing module won't work. As for copying only certain parts, I could try to rewrite the matrix operations to work only with small blocks and then recombine them. Is that what you meant?

abarnert Over a year ago

Well, the lock may not be related to the copying, but the cPickle.dumps almost certainly is. You might be able to save time by manually pickling the matrix before spawning children, then sharing the pickle (and manually unpickling in the children). Or come up with your own quicker and/or more compact representation than generic pickling, even. But first… are you sure that's 66% of the entire time, or just 66% of the time in the parent process itself (which may be doing very little real work)? Finally, yes, rewriting the operations to work on small blocks is what I meant, if it's plausible.

abarnert Over a year ago

Oh, and if the end goal is to use a cluster, unless it has some form of clustered memory sharing (which is accessible from Python), you obviously won't be able to use shared memory in the final version, so I wouldn't put effort into using it for this preliminary version. The shared filesystem plus mmap may be worth it, but may be hard to performance test without actually using it on the cluster.

Nino Over a year ago

They definitely take 66% of the entire time. The method that contains my originally posted code takes 522 of the 525 seconds that the script runs. That method is in a for loop that operates on a constantly changing matrix. The only problem with rewriting the matrix operations is that, for example, in a multiplication, each row is multiplied by each column in the other matrix. I don't see a way in which to break up the data (cutting down transfer time) without sending copies of the same values to different processes and thereby using unnecessary amounts of memory.

abarnert Over a year ago

Well, no more unnecessary memory than copying the entire thing to each process… but it sounds like you're saying there's no obvious way to break down the data and compose everything at the end. So, let's assume you have to copy the whole thing to each process, possibly on a separate computer. So the suggestions about manually managing the pickling, doing better than cPickle, or mmap'ing shared files seem like the only plausible ways forward. (Can you rely on the entire cluster having access to, say, an NFS or SMB share? Do you have access to a cluster to play with now?)

|

Collectives™ on Stack Overflow

Memory sharing in parallelized python code

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related