1

I have dataframe with 2 columns, one is group and second one is vector embeddings. The data is already like that so I don't want to argue about the embedding columns. The embedding columns all share the same number of dimension.

Basically I want to calculate the average of embedding for each group. By average I mean is axis level average. So [1,2] and [4,8] got average to [2.5,5]

import pandas as pd
import numpy as np

df = pd.DataFrame({"group":["a","a","b","b"],"embedding":[[0,1],[1,0],[0,0],[1,1]]})
df['embedding'] = df['embedding'].apply(np.array)

df.groupby("group").agg({"embedding":"mean"}) #This raise error
/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, numeric_only)
   1497             "mean",
   1498             alt=lambda x, axis: Series(x).mean(numeric_only=numeric_only),
-> 1499             numeric_only=numeric_only,
   1500         )
   1501 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
   1079 
   1080         if not output:
-> 1081             raise DataError("No numeric types to aggregate")
   1082 
   1083         return self._wrap_aggregated_output(output, index=self.grouper.result_index)

DataError: No numeric types to aggregate

Expected Output :

pd.DataFrame({"group":["a","b"],"embedding":[[0.5,0.5],[0.5,0.5]]})

Fast solution is very appreciated since my data is quite huge.

2
  • So the elements of that column are 2 element lists, and lists don't do math. [1,2]+[3,4] is not [4,6] Commented Jul 2, 2021 at 1:41
  • thats why i convert it to np.array using df['embedding'].apply(np.array) Commented Jul 2, 2021 at 1:42

4 Answers 4

2

If elements in the embedding column are guaranteed to be the same shape numpy arrays, you can use groupby + apply and use Series.mean method to calculate the elementwise average:

df.groupby('group').embedding.apply(lambda g: g.mean()).reset_index()
#  group   embedding
#0     a  [0.5, 0.5]
#1     b  [0.5, 0.5]
Sign up to request clarification or add additional context in comments.

1 Comment

Your solution is currently the fastest one and easier to read :)
1

Alternative:

df = df.groupby("group")["embedding"].apply(lambda x: np.mean(
    np.hstack(x).reshape(-1, 2), axis = 0)).reset_index()

Complete Example:

import pandas as pd
import numpy as np

df = pd.DataFrame({"group": ["a", "a", "b", "b"], "embedding": [
                  [0, 1], [1, 0], [0, 0], [1, 1]]})

df = df.groupby("group")["embedding"].apply(lambda x: np.mean(
    np.hstack(x).reshape(-1, 2), axis = 0)).reset_index()

Comments

0

Try with apply np.mean

df.groupby('group')['embedding'].apply(np.mean).reset_index()
  group   embedding
0     a  [0.5, 0.5]
1     b  [0.5, 0.5]

1 Comment

This one speed is on par with @Psidom answer but the code is simpler. Love it
0

Here is another way:

df.groupby('group').agg({'embedding':lambda x: x.map(np.array).mean().tolist()})

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.