2

I'm looking for an efficient, all-Pandas way of creating an array with group numbers (for every row in the original dataframe I want a number that tells me which group this row belongs to):

df = pandas.DataFrame({'a': [1, 1, 1, 2, 2, 2], 'b': [1, 2, 1, 1, 2, 1]})
groups = df.groupby(['a', 'b'])
group_names = sorted(groups.groups.keys())
group_indices = np.array(df.index)
for index, group in enumerate(group_names):
    group_indices[groups.indices[group]] = index

where

In : df 
Out]:
   a  b
0  1  1
1  1  2
2  1  1
3  2  1
4  2  2
5  2  1    

In : groups.indices
Out:
{(1, 1): array([0, 2]),
 (1, 2): array([1]),
 (2, 1): array([3, 5]),
 (2, 2): array([4])}

In : group_indices
Out: array([0, 1, 0, 2, 3, 2])

My problem is that if df is around 20000x100 (64 bit floats) and I group by two of the columns, I get memory usage above 6 GB. Which is way more than I'd expect.

1 Answer 1

5

The indicies are already embedded in the groupby object

In [52]: groups.grouper.levels
Out[52]: [Int64Index([1, 2], dtype=int64), Int64Index([1, 2], dtype=int64)]

In [53]: groups.grouper.labels
Out[53]: [array([0, 0, 0, 1, 1, 1]), array([0, 1, 0, 0, 1, 0])]

In [57]: l = groups.grouper.labels

In [58]: zip(*l)
Out[58]: [(0, 0), (0, 1), (0, 0), (1, 0), (1, 1), (1, 0)]

In [18]: groups.grouper.group_info
Out[18]: (array([0, 1, 0, 2, 3, 2]), array([0, 1, 2, 3]), 4)

Simple lookup operations as these are already computed on the grouping object

In [19]: groups.grouper.group_info[0]
Out[19]: array([0, 1, 0, 2, 3, 2])
Sign up to request clarification or add additional context in comments.

3 Comments

This also works quite nicely, and it certainly is cleaner though I think it's more magic since I have real trouble finding documentation on the group objects. I could also find the correct labels using groups.grouper.result_index.tolist(). Thanks!
you never explained why you are doing with the info, you normally would have no need for this as the groupby takes care of the bookkeeping in its operations. what are you trying to do?
thanks for mentioning grouper, as it is not documented! I finally found a solution to change rows in original dataframe while iterating over its grouped object, using grouped.grouper.indices. I had to use it because I have duplicate DateTime indices in the dataframe. Also the transformation is too complicated to fit a grouped then apply paradigm, it involves clustering and filling in multiple dataframes at once while going through each group.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.