Optimizing a multithreaded numpy array function

Question

Given 2 large arrays of 3D points (I'll call the first "source", and the second "destination"), I needed a function that would return indices from "destination" which matched elements of "source" as its closest, with this limitation: I can only use numpy... So no scipy, pandas, numexpr, cython...

To do this i wrote a function based on the "brute force" answer to this question. I iterate over elements of source, find the closest element from destination and return its index. Due to performance concerns, and again because i can only use numpy, I tried multithreading to speed it up. Here are both threaded and unthreaded functions and how they compare in speed on an 8 core machine.

import timeit
import numpy as np
from numpy.core.umath_tests import inner1d
from multiprocessing.pool import ThreadPool

def threaded(sources, destinations):
    # Define worker function
    def worker(point):
        dlt = (destinations-point) # delta between destinations and given point
        d = inner1d(dlt,dlt) # get distances
        return np.argmin(d) # return closest index

    # Multithread!
    p = ThreadPool()
    return p.map(worker, sources)


def unthreaded(sources, destinations):
    results = []
    #for p in sources:
    for i in range(len(sources)):
        dlt = (destinations-sources[i]) # difference between destinations and given point
        d = inner1d(dlt,dlt) # get distances
        results.append(np.argmin(d)) # append closest index

    return results


# Setup the data
n_destinations = 10000 # 10k random destinations
n_sources = 10000      # 10k random sources
destinations= np.random.rand(n_destinations,3) * 100
sources = np.random.rand(n_sources,3) * 100

#Compare!
print 'threaded:   %s'%timeit.Timer(lambda: threaded(sources,destinations)).repeat(1,1)[0]
print 'unthreaded: %s'%timeit.Timer(lambda: unthreaded(sources,destinations)).repeat(1,1)[0]

Retults:

threaded:   0.894030461056
unthreaded: 1.97295164054

Multithreading seems beneficial but I was hoping for more than 2X increase given the real life dataset i deal with are much larger.

All recommendations to improve performance (within the limitations described above) will be greatly appreciated!

Before trying to add horsepower by means of multithreading, I would start with better algorithms... Currently your brute force approach is O(N*M), while, after preparing a KD tree of the target points, you could make it O(N*log(M)). — Matteo Italia
– Matteo Italia, Commented Aug 5, 2015 at 23:37
@ali_m i have to use a custom build of python called mayapy, Autodesk's implementation of python inside their 3D software (Maya). It's compiled against VC2010 64bit, so pre-built binaries such as Christoph Gohlke's and others aren't compatible. I was able to build numpy from scratch but failed to build scipy, numexpr, etc. — Fnord
– Fnord, Commented Aug 6, 2015 at 4:45
@MatteoItalia I did look for pure python KD tree implementations, notably i found this one and this one. Both were vastly slower (like 10x) than my unthreaded numpy test. So i'm afraid that without access to scipy (as stated above) KD trees are a dead end. — Fnord
– Fnord, Commented Aug 6, 2015 at 5:54
Just for reference, on my puny dual core laptop cKDTree takes about 17ms to both construct the tree and query it using your example data, compared with 1.53s for threaded and 1.39s for unthreaded. I am absolutely convinced that k-D trees are worth pursuing. I'm very surprised that you saw such slow query times for pykdtree, which seems to be implemented in C/Cython. — ali_m
– ali_m, Commented Aug 6, 2015 at 20:14

Felipe Lema · Accepted Answer · 2015-08-06 17:28:21Z

2

Ok, I've been reading Maya documentation on python and I came to these conclusions/guesses:

They're probably using CPython inside (several references to that documentation and not any other).
They're not fond of threads (lots of non-thread safe methods)

Since the above, I'd say it's better to avoid threads. Because of the GIL problem, this is a common problem and there are several ways to do the earlier.

Try to build a tool C/C++ extension. Once that is done, use threads in C/C++. Personally, I'd only try SIP to work, and then move on.
Use multiprocessing. Even if your custom python distribution doesn't include it, you can get to a working version since it's all pure python code. multiprocessing is not affected by the GIL since it spawns separate processes.
The above should've worked out for you. If not, try another parallel tool (after some serious praying).

On a side note, if you're using outside modules, be most mindful of trying to match maya's version. This may have been the reason because you couldn't build scipy. Of course, scipy has a huge codebase and the windows platform is not the most resilient to build stuff.

answered Aug 6, 2015 at 17:28

Felipe Lema

2,72814 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Fnord Over a year ago

Yeah it's very annoying. Because it seems there's no way to imrpove performance by staying purely in numpy, I may end up pickeling the data out and have an external "standard" python instance running (with scipy and all its arsenal) to crunch it down and spit out the result. Not what i want but at this time i need to move on.

Collectives™ on Stack Overflow

Optimizing a multithreaded numpy array function

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related