5

Ok, after some searching I can't seem to find a SO question that directly tackles this. I've looked into masked arrays and although they seem cool, I'm not sure if they are what I need.

consider 2 numpy arrays:

zone_data is a 2-d numpy array with clumps of elements with the same value. This is my 'zones'.

value_data is a 2-d numpy array (exact shape of zone_data) with arbitrary values.

I seek a numpy array of same shape as zone_data/value_data that has the average values of each zone in place of the zone numbers.

example...in ascii art form.

zone_data (4 distinct zones):

1, 1, 2, 2
1, 1, 2, 2
3, 3, 4, 4
3, 4, 4, 4

value_data:

1, 2, 3, 6
3, 0, 2, 5
1, 1, 1, 0
2, 4, 2, 1

my result, call it result_data:

1.5, 1.5, 4.0, 4.0
1.5, 1.5, 4.0, 4.0
2.0, 2.0, 1.0, 1.0
2.0, 2.0, 1.0, 1.0

here's the code I have. It works fine as far as giving me a perfect result.

result_data = np.zeros(zone_data.shape)
for i in np.unique(zone_data):
    result_data[zone_data == i] = np.mean(value_data[zone_data == i])

My arrays are big and my code snippet takes several seconds. I think I have a knowledge gap and haven't hit on anything helpful. The loop aspect needs to be delegated to a library or something...aarg!

I seek help to make this FASTER! Python gods, I seek your wisdom!

EDIT -- adding benchmark script

import numpy as np
import time

zones = np.random.randint(1000, size=(2000,1000))
values = np.random.rand(2000,1000)

print 'start method 1:'
start_time = time.time()

result_data = np.zeros(zones.shape)
for i in np.unique(zones):
    result_data[zones == i] = np.mean(values[zones == i])

print 'done method 1 in %.2f seconds' % (time.time() - start_time)

print
print 'start method 2:'
start_time = time.time()

#your method here!

print 'done method 2 in %.2f seconds' % (time.time() - start_time)

my output:

start method 1:
done method 1 in 4.34 seconds

start method 2:
done method 2 in 0.00 seconds
0

2 Answers 2

3

You could use np.bincount:

count = np.bincount(zones.flat)
tot = np.bincount(zones.flat, weights=values.flat)
avg = tot/count
result_data2 = avg[zones]

which gives me

start method 1:
done method 1 in 3.13 seconds

start method 2:
done method 2 in 0.01 seconds
>>> 
>>> np.allclose(result_data, result_data2)
True
Sign up to request clarification or add additional context in comments.

2 Comments

Excellent use of bincount. +1
DSM, that's awesome! I love SO mostly because of the people like yourself who can share some specific knowledge that would have taken me a long time to find myself. Thank you so much! This was not just a trivial exercise...this will open one of the bottle necks I have in an application. Love the "np.allclose" too...what a gem.
1

I thought I had seen this in scipy somewhere, but I can't find it anymore. Have you looked there?

Anyway, you can get a first improvement by changing your loop:

result_data = np.empty(zones.shape)  # minor speed gain
for label in np.unique(zones):
    mask = zones==label
    result_data[mask] = np.mean(values[mask])

That way you don't needlessly do the boolean comparison twice. That 'll cut down the execution time a bit.

1 Comment

that's a good observation. In my case it would save about 40%...which is great and I should have known better...I've done that in many other spots. I will take DSM's answer, however as it's 100+ times faster!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.