First I'll show a nice solution using structured arrays. The linked documentation has lots of good information on various way to index, sort, and create them.
Lets define a subset of your data,
import numpy as np
X = np.array( [[3.4,9.13], [3.5,3.43], [3.6,2.01], [3.7,6.11],
[3.8,4.95], [3.9,7.02], [4.0,4.41]] )
I = np.array( [0,1,2,0,1,2,3], dtype=np.int32 )
Structured Array
If we make a structured array (i.e. an array of structs) from this data, the problem is trivial,
sa = np.zeros( len(X), dtype=[('I',np.int64),('X',np.float64,(2))] )
Here we've made an empty structured array. Each element of the array is a 64 bit integer and a 2 element array of 64 bit floats. The list passed to dtype defines the struct with each tuple representing a component of the struct. The tuples contain a label, a type, and a shape. The shape part is optional and defaults to a scalar entry.
Next we fill the structured array with your data,
sa['I'] = I
sa['X'] = X
At this point you can access the records like so,
>>> sa['X'][sa['I']==2]
array([[ 3.6 , 2.01],
[ 3.9 , 7.02]])
Here we've asked for all the 'X' records and indexed them using the bool array created by the statement sa['I']==2. The dictionary you want can then be constructed using a comprehension,
d = { i:sa['X'][sa['I']==i] for i in np.unique(sa['I']) }
Next are two solutions using standard numpy arrays. The first uses np.where and leaves the arrays unmodified and another that involves sorting the arrays which should be faster for large I.
Using np.where
The use of np.where is not strictly necessary as arrays can be indexed using the bool array produced from I==I0 below, but having the actual indices as ints is useful in some circumstances.
def indexby1( X,I,I0 ):
indx = np.where( I==I0 )
sub = X[indx[0],:]
return sub
def indexby2( X,I ):
d = {}
I0max = I.max()
for I0 in range(I0max+1):
d[I0] = indexby1( X, I, I0 )
return d
d = indexby2( X, I )
Sorting and pulling out chunks
Alternatively you can use the sorting solution mentioned and just return chunks,
def order_arrays( X, I ):
indx = I.argsort()
I = I[indx]
X = [indx] # equivalent to X = X[indx,:]
return X, I
def indexby(X, I, I0=None):
if I0 == None:
d = {}
for I0 in range(I.max()+1):
d[I0] = indexby( X, I, I0 )
return d
else:
ii = I.searchsorted(I0)
ff = I.searchsorted(I0+1)
sub = X[ii:ff]
return sub
X,I = order_array( X, I )
d = indexby( X, I )
Here I've combined the two previous functions into one recursive function as you described the signature in your question. This will of course modify the original arrays.
np.whereis probably a good place to start.np.searchsortedis another nice onepandas(especially if you've worked with SAS/R). It's based onnumpy.