1

I have a DataFrame that can be produced using this Python code:

import pandas as pd

df = pd.DataFrame({'visit': [1] * 6 + [2] * 6,
                   'time': [t for t in range(6)] * 2,
                   'observations': [o for o in range(12)]})

The following code enables me to reformat the data as desired:

dflist = []
for v_ in df.visit.unique():
    for t_ in df.time[df.visit == v_]:
        dflist.append([df[(df.visit == v_) & (df.time <= t_)].groupby('visit')['observations'].apply(list)])

pd.DataFrame(pd.concat([df[0] for df in dflist], axis=0))

However this is extremely slow.

I have tried using .expanding(), however, this will only return scalars whereas I would like list (or numpy array).

I would appreciate any help in vectorizing or otherwise optimizing this procedure.

Thanks

2 Answers 2

2

Fortunately, in pandas 1.1.0 and newer, expanding produces an iterable which can be used to use take advantage of the faster grouping, but produce non-scaler data like lists:

new_df = pd.DataFrame({
    'observations':
        [list(x) for x in df.groupby('visit')['observations'].expanding()]
}, index=df['visit'])

new_df:

               observations
visit                      
1                       [0]
1                    [0, 1]
1                 [0, 1, 2]
1              [0, 1, 2, 3]
1           [0, 1, 2, 3, 4]
1        [0, 1, 2, 3, 4, 5]
2                       [6]
2                    [6, 7]
2                 [6, 7, 8]
2              [6, 7, 8, 9]
2          [6, 7, 8, 9, 10]
2      [6, 7, 8, 9, 10, 11]

Timing via %timeit:

Setup:

import pandas as pd
df = pd.DataFrame({'visit': [1] * 6 + [2] * 6,
                   'time': [t for t in range(6)] * 2,
                   'observations': [o for o in range(12)]})

Original:

def fn():
    dflist = []
    for v_ in df.visit.unique():
        for t_ in df.time[df.visit == v_]:
            dflist.append([
                df[(df.visit == v_) & (df.time <= t_)]
                    .groupby('visit')['observations'].apply(list)
            ])

    return pd.DataFrame(pd.concat([df[0] for df in dflist], axis=0))

%timeit fn()
13 ms ± 692 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

List comprehension with expanding (~13x faster on this sample):

def fn2():
    return pd.DataFrame({
        'observations':
            [list(x) for x in df.groupby('visit')['observations'].expanding()]
    }, index=df['visit'])

%timeit fn2()
967 µs ± 57.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Sanity Check:

fn().eq(fn2()).all(axis=None)  # True

The double apply approach by @Quixotic22 (~3.4x faster than the original ~3.9x slower than comprehension + expanding):

def fn3():
    return (df.
            set_index('visit')['observations'].
            apply(lambda x: [x]).
            reset_index().groupby('visit')['observations'].
            apply(lambda x: x.cumsum()))

%timeit fn3()
3.78 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

*Note this approach only produces a series of observations, does not include the visit as the index.

fn().eq(fn3()).all(axis=None)  # False
Sign up to request clarification or add additional context in comments.

1 Comment

TIL: that expanding() can be iterated... +1
1

Looks like a good solution has been provided but dropping this here as a viable alternative.

(df.
 set_index('visit')['observations'].
 apply(lambda x: [x]).
 reset_index().groupby('visit')['observations'].
 apply(lambda x: x.cumsum())
 )

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.