calculate aggregate of numpy array with pandas groupby

Question

I have dataframe with 2 columns, one is group and second one is vector embeddings. The data is already like that so I don't want to argue about the embedding columns. The embedding columns all share the same number of dimension.

Basically I want to calculate the average of embedding for each group. By average I mean is axis level average. So [1,2] and [4,8] got average to [2.5,5]

import pandas as pd
import numpy as np

df = pd.DataFrame({"group":["a","a","b","b"],"embedding":[[0,1],[1,0],[0,0],[1,1]]})
df['embedding'] = df['embedding'].apply(np.array)

df.groupby("group").agg({"embedding":"mean"}) #This raise error

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, numeric_only)
   1497             "mean",
   1498             alt=lambda x, axis: Series(x).mean(numeric_only=numeric_only),
-> 1499             numeric_only=numeric_only,
   1500         )
   1501 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
   1079 
   1080         if not output:
-> 1081             raise DataError("No numeric types to aggregate")
   1082 
   1083         return self._wrap_aggregated_output(output, index=self.grouper.result_index)

DataError: No numeric types to aggregate

Expected Output :

pd.DataFrame({"group":["a","b"],"embedding":[[0.5,0.5],[0.5,0.5]]})

Fast solution is very appreciated since my data is quite huge.

So the elements of that column are 2 element lists, and lists don't do math. [1,2]+[3,4] is not [4,6] — hpaulj
– hpaulj, Commented Jul 2, 2021 at 1:41
thats why i convert it to np.array using df['embedding'].apply(np.array) — Vinson Ciawandy
– Vinson Ciawandy, Commented Jul 2, 2021 at 1:42

akuiper · Accepted Answer · 2021-07-02 01:48:42Z

2

If elements in the embedding column are guaranteed to be the same shape numpy arrays, you can use groupby + apply and use Series.mean method to calculate the elementwise average:

df.groupby('group').embedding.apply(lambda g: g.mean()).reset_index()
#  group   embedding
#0     a  [0.5, 0.5]
#1     b  [0.5, 0.5]

edited Jul 2, 2021 at 1:48

answered Jul 2, 2021 at 1:43

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Vinson Ciawandy Over a year ago

Your solution is currently the fastest one and easier to read :)

Nk03 · Accepted Answer · 2021-07-02 01:39:20Z

1

Alternative:

df = df.groupby("group")["embedding"].apply(lambda x: np.mean(
    np.hstack(x).reshape(-1, 2), axis = 0)).reset_index()

Complete Example:

import pandas as pd
import numpy as np

df = pd.DataFrame({"group": ["a", "a", "b", "b"], "embedding": [
                  [0, 1], [1, 0], [0, 0], [1, 1]]})

df = df.groupby("group")["embedding"].apply(lambda x: np.mean(
    np.hstack(x).reshape(-1, 2), axis = 0)).reset_index()

answered Jul 2, 2021 at 1:39

Nk03

15k2 gold badges11 silver badges24 bronze badges

Comments

BENY · Accepted Answer · 2021-07-02 01:58:00Z

0

Try with apply np.mean

df.groupby('group')['embedding'].apply(np.mean).reset_index()
  group   embedding
0     a  [0.5, 0.5]
1     b  [0.5, 0.5]

answered Jul 2, 2021 at 1:58

BENY

324k22 gold badges176 silver badges250 bronze badges

1 Comment

Vinson Ciawandy Over a year ago

This one speed is on par with @Psidom answer but the code is simpler. Love it

rhug123 · Accepted Answer · 2021-07-02 03:17:21Z

0

Here is another way:

df.groupby('group').agg({'embedding':lambda x: x.map(np.array).mean().tolist()})

answered Jul 2, 2021 at 3:17

rhug123

8,8801 gold badge14 silver badges27 bronze badges

Collectives™ on Stack Overflow

calculate aggregate of numpy array with pandas groupby

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related