2

I have data frame like this:

    Name1  Name2   Start End
    aaa    bbb     1     2
    aaa    bbb     2     22
    aaa    bbb     30    42
    ccc    ddd     100   141
    ccc    ddd     145   160
    ccc    ddd     160   178

How do I merge rows that the end time of the first row is equal to the start time of the second row, otherwise keep the row as is. The expected result look like this:

    Name1  Name2   Start End
    aaa    bbb     1     22
    aaa    bbb     30    42
    ccc    ddd     100   141
    ccc    ddd     145   178

I can do this use iterrow, but I am wondering if there is a better way like apply or groupby to do so.

1 Answer 1

2

To rephrase the problem, you need to find intervals that don't overlap: if we sort Start column in ascending order, then whenever the cumulative maximum End is smaller than the next Start, you have a new interval, and based on this observation, you can create a new group variable and aggregate new Start and End for the merged intervals:

df.sort_values('Start', inplace=True)
df.groupby(['Name1', 'Name2']).apply(
  lambda g: g.groupby((g.End.cummax().shift() < g.Start).cumsum()).agg({'Start': min, 'End': max})
).reset_index(level=[0,1])

  Name1 Name2  Start  End
0   aaa   bbb      1   22
1   aaa   bbb     30   42
0   ccc   ddd    100  141
1   ccc   ddd    145  178
Sign up to request clarification or add additional context in comments.

1 Comment

Groupby within another groupby always confuse me. Could you elaborate more on how you do this? I want to learn the thinking process of this approach so I can use it in the future. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.