Pandas: Create array with group indices

Question

I'm looking for an efficient, all-Pandas way of creating an array with group numbers (for every row in the original dataframe I want a number that tells me which group this row belongs to):

df = pandas.DataFrame({'a': [1, 1, 1, 2, 2, 2], 'b': [1, 2, 1, 1, 2, 1]})
groups = df.groupby(['a', 'b'])
group_names = sorted(groups.groups.keys())
group_indices = np.array(df.index)
for index, group in enumerate(group_names):
    group_indices[groups.indices[group]] = index

where

In : df 
Out]:
   a  b
0  1  1
1  1  2
2  1  1
3  2  1
4  2  2
5  2  1    

In : groups.indices
Out:
{(1, 1): array([0, 2]),
 (1, 2): array([1]),
 (2, 1): array([3, 5]),
 (2, 2): array([4])}

In : group_indices
Out: array([0, 1, 0, 2, 3, 2])

My problem is that if df is around 20000x100 (64 bit floats) and I group by two of the columns, I get memory usage above 6 GB. Which is way more than I'd expect.

Jeff · Accepted Answer · 2013-07-02 00:40:02Z

5

The indicies are already embedded in the groupby object

In [52]: groups.grouper.levels
Out[52]: [Int64Index([1, 2], dtype=int64), Int64Index([1, 2], dtype=int64)]

In [53]: groups.grouper.labels
Out[53]: [array([0, 0, 0, 1, 1, 1]), array([0, 1, 0, 0, 1, 0])]

In [57]: l = groups.grouper.labels

In [58]: zip(*l)
Out[58]: [(0, 0), (0, 1), (0, 0), (1, 0), (1, 1), (1, 0)]

In [18]: groups.grouper.group_info
Out[18]: (array([0, 1, 0, 2, 3, 2]), array([0, 1, 2, 3]), 4)

Simple lookup operations as these are already computed on the grouping object

In [19]: groups.grouper.group_info[0]
Out[19]: array([0, 1, 0, 2, 3, 2])

edited Jul 2, 2013 at 0:40

answered Jul 1, 2013 at 14:07

Jeff

130k21 gold badges223 silver badges189 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

gc. Over a year ago

This also works quite nicely, and it certainly is cleaner though I think it's more magic since I have real trouble finding documentation on the group objects. I could also find the correct labels using groups.grouper.result_index.tolist(). Thanks!

Jeff Over a year ago

you never explained why you are doing with the info, you normally would have no need for this as the groupby takes care of the bookkeeping in its operations. what are you trying to do?

dashesy Over a year ago

thanks for mentioning grouper, as it is not documented! I finally found a solution to change rows in original dataframe while iterating over its grouped object, using grouped.grouper.indices. I had to use it because I have duplicate DateTime indices in the dataframe. Also the transformation is too complicated to fit a grouped then apply paradigm, it involves clustering and filling in multiple dataframes at once while going through each group.

Collectives™ on Stack Overflow

Pandas: Create array with group indices

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related