Here's an example of setting the mass with just numpy. It does use iteration, so for large arrays it won't be fast.
For just 10 rows, this is much faster than the pandas equivalent. But as the data set becomes larger (eg. M=10000), pandas is much better. The setup time for pandas is larger, but the per row iteration time much lower.
Generate test arrays:
dt_members = np.dtype({'names':['groupID','groupmass'], 'formats': [int, float]})
dt_groups = np.dtype({'names':['groupID', 'mass'], 'formats': [int, float]})
N, M = 5, 10
members = np.zeros((M,), dtype=dt_members)
groups = np.zeros((N,), dtype=dt_groups)
members['groupID'] = np.random.randint(101, 101+N, M)
groups['groupID'] = np.arange(101, 101+N)
groups['mass'] = np.arange(1,N+1)
def getgroup(id):
idx = id==groups['groupID']
return groups[idx]
members['groupmass'][:] = [getgroup(id)['mass'] for id in members['groupID']]
In python2 the iteration could use map:
members['groupmass'] = map(lambda x: getgroup(x)['mass'], members['groupID'])
I can improve the speed by about 2x by minimizing the repeated subscripting, eg.
def setmass(members, groups):
gmass = groups['mass']
gid = groups['groupID']
mass = [gmass[id==gid] for id in members['groupID']]
members['groupmass'][:] = mass
But if groups['groupID'] can be mapped onto arange(N), then we can get a big jump in speed. By applying the same mapping to members['groupID'], it becomes a simple array indexing problem.
In my sample arrays, groups['groupID'] is just arange(N)+101. So the mapping just subtracts that minimum.
def setmass1(members, groups):
members['groupmass'][:] = groups['mass'][members['groupID']-groups['groupID'].min()]
This is 300x faster than my earlier code, and 8x better than the pandas solution (for 10000,500 arrays).
I suspect pandas does something like this. pgroups.set_index('groupID').mass is the mass Series, with an added .index attribute. (I could test this with a more general array)
In a more general case, it might help to sort groups, and if necessary, fill in some indexing gaps.
Here's a 'vectorized' solution - no iteration. But it has to calculate a very large matrix (length of groups by length of members), so does not gain much speed (np.where is the slowest step).
def setmass2(members, groups):
idx = np.where(members['groupID'] == groups['groupID'][:,None])
members['groupmass'][idx[1]] = groups['mass'][idx[0]]