I have dataframe with 2 columns, one is group and second one is vector embeddings. The data is already like that so I don't want to argue about the embedding columns. The embedding columns all share the same number of dimension.
Basically I want to calculate the average of embedding for each group. By average I mean is axis level average. So [1,2] and [4,8] got average to [2.5,5]
import pandas as pd
import numpy as np
df = pd.DataFrame({"group":["a","a","b","b"],"embedding":[[0,1],[1,0],[0,0],[1,1]]})
df['embedding'] = df['embedding'].apply(np.array)
df.groupby("group").agg({"embedding":"mean"}) #This raise error
/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, numeric_only)
1497 "mean",
1498 alt=lambda x, axis: Series(x).mean(numeric_only=numeric_only),
-> 1499 numeric_only=numeric_only,
1500 )
1501
/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
1079
1080 if not output:
-> 1081 raise DataError("No numeric types to aggregate")
1082
1083 return self._wrap_aggregated_output(output, index=self.grouper.result_index)
DataError: No numeric types to aggregate
Expected Output :
pd.DataFrame({"group":["a","b"],"embedding":[[0.5,0.5],[0.5,0.5]]})
Fast solution is very appreciated since my data is quite huge.
[1,2]+[3,4]is not[4,6]df['embedding'].apply(np.array)