1

I have an array that contains numbers that are distances, and another that represents certain values at that distance. How do I calculate the average of all the data at a fixed value of the distance?

e.g distances (d): [1 1 14 6 1 12 14 6 6 7 4 3 7 9 1 3 3 6 5 8]

e.g data corresponding to the entry of the distances:

therefore value=3.3 at d=1; value=2,1 at d=1; value=3.5 at d=14; etc..

[3.3 2.1 3.5 2.5 4.6 7.4 2.6 7.8 9.2 10.11 14.3 2.5 6.7 3.4 7.5 8.5 9.7 4.3 2.8 4.1]

For example, at distance d=6 I should do the mean of 2.5, 7.8, 9.2 and 4.3

I've used the following code that works, but I do not know how to store the values into a new array:

from numpy import mean

for d in set(key): 
    print d, mean([dist[i] for i in range(len(key)) if key[i] == d])

Please help! Thanks

5 Answers 5

1

You've got the hard part done, just putting your results into a new list is as easy as:

result = []
for d in set(key): 
    result.append(mean([dist[i] for i in range(len(key)) if key[i] == d]))
Sign up to request clarification or add additional context in comments.

Comments

1

Using pandas

g = pd.DataFrame({'d':d, 'k':k}).groupby('d')

Option 1: transform to get the values in the same positions

g.transform('mean').values

Option2: mean directly and get a dict with the mapping

g.mean().to_dict()['k']

Comments

0

Setup

d = np.array(
  [1, 1, 14, 6, 1, 12, 14, 6, 6, 7, 4, 3, 7, 9, 1, 3, 3, 6, 5, 8]
)

k = np.array(
  [3.3,2.1,3.5,2.5,4.6,7.4,2.6,7.8,9.2,10.11,14.3,2.5,6.7,3.4,7.5,8.5,9.7,4.3,2.8,4.1]
)

scipy.sparse + csr_matrix

from scipy import sparse

s = d.shape[0]
r = np.arange(s+1)
m = d.max() + 1
b = np.bincount(d)

out = sparse.csr_matrix( (k, d, r), (s, m) ).sum(0).A1

(out / b)[d]

array([ 4.375,  4.375,  3.05 ,  5.95 ,  4.375,  7.4  ,  3.05 ,  5.95 ,
        5.95 ,  8.405, 14.3  ,  6.9  ,  8.405,  3.4  ,  4.375,  6.9  ,
        6.9  ,  5.95 ,  2.8  ,  4.1  ])

Comments

0

You could use array from the numpy lib in combination with where, also from the same lib.

You can define a function to get the positions of the desired distances:

from numpy import mean, array, where  

def key_distances(distances, d):
  return where(distances == d)[0]

then you use it for getting the values at those positions.

Let's say you have:

d = array([1,1,14,6,1,12,14,6,6,7,4,3,7,9,1,3,3,6,5,8])
v = array([3.3,2.1,3.5,2.5,4.6,7.4,2.6,7.8,9.2,10.11,14.3,2.5,6.7,3.4,7.5,8.5,9.7,4.3,2.8,4.1])

Then you might do something like:

vs = v[key_distances(d,d[1])]

Then get your mean:

print mean(vs)

Comments

0

The numpy_indexed package (disclaimer: I am its author) was designed with these use-cases in mind:

import numpy_indexed as npi
npi.group_by(d).mean(dist)

Pandas can do similar things; but its api isnt really tailored to these things; and for such an elementary operation as a group-by I feel its kinda wrong to have to hoist your data into a completely new datastructure.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.