0

I want to be able to use a "groupby" on my pandas dataframe using different custom functions for each columns. For example, if I have this as input:

annotator  event          interval_presence   duration
3          birds          [0,5]               5
3          birds          [7,9]               10
3          voices         [1,2]               10
3          traffic        [1,7]               7
5          voices         [4,7]               4
5          voices         [5,10]              6
5          traffic        [0,1]               4

Where each item in "interval_presence" is a pandas interval. When merging, I want to take the mean of column "duration" and I want to use "pd.arrays.IntervalArray" and "piso.union" on my intervals in "interval_presence". So this would be the output:

annotator  event          interval_presence   duration
3          birds          [[0,5],[7,9]]       7.5
3          voices         [1,2]               10
3          traffic        [1,7]               7
5          voices         [4,10]              5
5          traffic        [0,1]               4

Right now, I know how to merge my intervals thanks to the answer in the post: Pandas: how to merge rows by union of intervals. So the solution would be:

data = data.groupby(['annotator', 'event'])['interval_presence'] \
    .apply(pd.arrays.IntervalArray) \
    .apply(piso.union) \
    .reset_index()

But how can I simultaneously apply a "mean" function to "duration" ?

2
  • Use groupby.agg with a {'colname': function} dictionary. Commented Jan 28, 2023 at 16:47
  • I've seen that I can use .agg, but what's the syntax when using custom functions ? Because something like df = df.groupby(['annotator', 'event'])['interval_presence'].agg({ 'interval_presence':'.apply(pd.arrays.IntervalArray).apply(piso.union)', 'duration':'mean'}).reset_index() isn't a good syntax Commented Jan 28, 2023 at 16:51

1 Answer 1

1

You used the wrong agg syntax. Try this:

df.groupby(["annotator", "event"]).agg({
    "interval_presence": lambda s: piso.union(pd.arrays.IntervalArray(s)),
    "duration": "mean"
})

Within the lambda, s is a series of pd.Interval objects.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.