3

I have three arrays

import numpy as np
value = np.array ([1, 3, 3, 5, 5, 7, 3])
index = np.array ([1, 1, 3, 3, 6, 6, 6])
data  = np.array ([1, 2, 3, 4, 5, 6])

Arrays "index" & "value" have same size and I want to group the items in "value" by taking average. For example: For the first two items [1, 3, ... in "value", have the same key 1 in "index", so for the final array the value is the mean of the 1st & 2rd items in value : (1 + 3 )/2 which is equal 2

The final array is:

[2, nan, 4, nan, nan, 5]

first value is the average of 1st and 2nd of "value"
second value is nan because there is not any key in "index" (no "2" in array index)
third value is the average of 3rd and 4th of "value" ...

Thanks for your help!!!

Regards, Roy

3
  • "[...]because there is not any key in index" - can you explain how the indices in the index array relate to the average values any better? Commented Jan 13, 2011 at 2:00
  • Oh sorry may be my explain no clear Arrays "index" & "value" have same size and I want to group the items in "value" by taking average For example: For the first two items [1, 3, ... in value have the same key 1 in "index", so for the final array the value is the mean of the 1st & 2rd items in value : (1 + 3 )/2 which is equal 2 Commented Jan 13, 2011 at 2:08
  • Just edit your original posting. Comments are not really made for that. Commented Jan 13, 2011 at 2:13

4 Answers 4

3
>>> [value[index==i].mean() for i in data]
[2.0, nan, 4.0, nan, nan, 5.0]
Sign up to request clarification or add additional context in comments.

Comments

3

Maybe you would like to use numpy.bincount()?

value = np.array([1, 3, 3, 5, 5, 7, 3])
index = np.array([1, 1, 3, 3, 6, 6, 6])
np.bincount(index, value) / np.bincount(index)
# array([ NaN,   2.,  NaN,   4.,  NaN,  NaN,   5.])

Comments

0

Is this the general idea you are looking for?

import numpy as np
value = np.array ([1, 3, 3, 5, 5, 7, 3])
index = np.array ([1, 1, 3, 3, 6, 6, 6])
data  = np.array ([1, 2, 3, 4, 5, 6])

answer = np.array(data, dtype=float)
for i, e in enumerate(data):
    idx = np.where(index==e)[0]
    val = value[idx]
    answer[i] = np.mean(val)

print answer # [  2.  nan   4.  nan  nan   5.]

If your data array is very large, there may be better solutions.

6 Comments

yes my data is actually very large :P, around 4320000 records. Sorry for unclear ask.
how big is value and index then?
is a len(value) by len(data) 2D array too big to fit in memory?
For "value" and "index" size is 4320000 , for "data" is smaller, 1124000 , the memory is not enough to make that huge array
Then I think I'd stick with the above solution. You could use an array mask instead of where to try to optimize, but I think you are stuck iterating with python. If it is still to slow, you can try cython.
|
0

I have searched for use numpy histogram to solve the huge array:

value = np.array ([1, 3, 3, 5, 5, 7, 3], dtype='float')
index = np.array ([1, 1, 3, 3, 6, 6, 6], dtype='float')
data = np.array ([1, 2, 3, 4, 5, 6])

sums = np.histogram(index , bins=np.arange(index.min(), index.max()+2), weights=value)[0]
counter = np.histogram(index , bins=np.arange(index.min(), index.max()+2))[0]

sums / counter

array([ 2., NaN, 4., NaN, NaN, 5.])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.