0

I am looking for a fast method to determine the cross-matching indices of two arrays, defined as follows.

I have two very large (>1e7 elements) structured arrays, one called members, and another called groups. Both arrays have a groupID column. The groupID entries of the groups array are unique, the groupID entries of the members array are not.

The groups array has a column called mass. The members array has a (currently empty) column called groupmass. I want to assign the correct groupmass to those elements of members with a groupID that matches one of the groups. This would be accomplished via:

members['groupmass'][idx_matched_members] = groups['mass'][idx_matched_groups]

So what I need is a fast routine to compute the two index arrays idx_matched_members and idx_matched_groups. This sort of task seems so common that it seems very likely that a package like numpy or pandas would have an optimized solution. Does anyone know of a solution, professionally developed, homebrewed, or otherwise?

2 Answers 2

3

This can be done with pandas using map to map the data from one column using the data of another. Here's an example with sample data:

members = pandas.DataFrame({
    'id': np.arange(10),
    'groupID': np.arange(10) % 3,
    'groupmass': np.zeros(10)
})

groups = pandas.DataFrame({
    'groupID': np.arange(3),
    'mass': np.random.randint(1, 10, 3)
})

This gives you this data:

>>> members
   groupID  groupmass  id
0        0          0   0
1        1          0   1
2        2          0   2
3        0          0   3
4        1          0   4
5        2          0   5
6        0          0   6
7        1          0   7
8        2          0   8
9        0          0   9
>>> groups
   groupID  mass
0        0     3
1        1     7
2        2     4

Then:

>>> members['groupmass'] = members.groupID.map(groups.set_index('groupID').mass)
>>> members
   groupID  groupmass  id
0        0          3   0
1        1          7   1
2        2          4   2
3        0          3   3
4        1          7   4
5        2          4   5
6        0          3   6
7        1          7   7
8        2          4   8
9        0          3   9

If you will often want to use the groupID as the index into groups, you can set it that way permanently so you won't have to use set_index every time you do this.

Sign up to request clarification or add additional context in comments.

1 Comment

that is exactly the functionality I was looking for. I just played around with some fake data of very large size, and the map routine is lightning fast and quite robust. Pandas is so damn awesome. Thanks!
0

Here's an example of setting the mass with just numpy. It does use iteration, so for large arrays it won't be fast.

For just 10 rows, this is much faster than the pandas equivalent. But as the data set becomes larger (eg. M=10000), pandas is much better. The setup time for pandas is larger, but the per row iteration time much lower.

Generate test arrays:

dt_members = np.dtype({'names':['groupID','groupmass'], 'formats': [int, float]})
dt_groups =  np.dtype({'names':['groupID', 'mass'], 'formats': [int, float]})

N, M = 5, 10
members = np.zeros((M,), dtype=dt_members)    
groups = np.zeros((N,), dtype=dt_groups)
members['groupID'] = np.random.randint(101, 101+N, M)
groups['groupID'] = np.arange(101, 101+N)
groups['mass']  = np.arange(1,N+1)

def getgroup(id):
    idx = id==groups['groupID']
    return groups[idx]

members['groupmass'][:] = [getgroup(id)['mass'] for id in members['groupID']]

In python2 the iteration could use map:

members['groupmass'] =  map(lambda x: getgroup(x)['mass'], members['groupID'])

I can improve the speed by about 2x by minimizing the repeated subscripting, eg.

def setmass(members, groups):
    gmass = groups['mass']
    gid = groups['groupID']
    mass = [gmass[id==gid] for id in members['groupID']]
    members['groupmass'][:] = mass

But if groups['groupID'] can be mapped onto arange(N), then we can get a big jump in speed. By applying the same mapping to members['groupID'], it becomes a simple array indexing problem.

In my sample arrays, groups['groupID'] is just arange(N)+101. So the mapping just subtracts that minimum.

def setmass1(members, groups):
    members['groupmass'][:] = groups['mass'][members['groupID']-groups['groupID'].min()]

This is 300x faster than my earlier code, and 8x better than the pandas solution (for 10000,500 arrays).

I suspect pandas does something like this. pgroups.set_index('groupID').mass is the mass Series, with an added .index attribute. (I could test this with a more general array)

In a more general case, it might help to sort groups, and if necessary, fill in some indexing gaps.


Here's a 'vectorized' solution - no iteration. But it has to calculate a very large matrix (length of groups by length of members), so does not gain much speed (np.where is the slowest step).

def setmass2(members, groups):
    idx = np.where(members['groupID'] == groups['groupID'][:,None])
    members['groupmass'][idx[1]] = groups['mass'][idx[0]]

2 Comments

The creation of the mass array can be done far faster this in pure numpy, @hpaulj , using numpy.repeat(groups['mass'], group['richness']). This works like greased lightning even for arrays in excess of 1e7 entries. What I was looking for in my question was an analogously fast method to use pure numpy to solve the ID matching problem that I now have a rapid pandas solution for. I am still looking for this.
If groups['groupID'] can be mapped onto arange, this becomes a simple and fast indexing task.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.