1

I'm creating an additional column "Total_Count" to store the cumulative count record by Site and Count_Record column information. My coding is almost done for total cumulative count. However, the Total_Count column is shift for a specific Card as below. Could someone help with code modification, thank you!

Expected Output:

enter image description here

Current Output:

enter image description here

My Code:

import pandas as pd
df1 = pd.DataFrame(columns=['site', 'card', 'date', 'count_record'],
      data=[['A', 'C1', '12-Oct', 5], 
            ['A', 'C1', '13-Oct', 10], 
            ['A', 'C1', '14-Oct', 18],
            ['A', 'C1', '15-Oct', 21], 
            ['A', 'C1', '16-Oct', 29],
            ['B', 'C2', '12-Oct', 11],
            ['A', 'C2', '13-Oct', 2],
            ['A', 'C2', '14-Oct', 7],
            ['A', 'C2', '15-Oct', 13],
            ['B', 'C2', '16-Oct', 4]])

df_append_temp=[]

total = 0
preCard = ''
preSite = ''
preCount = 0

for pc in df1['card'].unique():
    df2 = df1[df1['card'] == pc].sort_values(['date'])

    total = 0

    for i in range(0, len(df2)):
        site = df2.iloc[i]['site']
        count = df2.iloc[i]['count_record']
    
        if site == preSite:
            total += (count - preCount)
        else:
            total += count
    
        preCount = count
        preSite = site

        df2.loc[i, 'Total_Count'] = total #something wrong using loc here
    
    df_append_temp.append(df2)

df3 = pd.DataFrame(pd.concat(df_append_temp), columns=df2.columns)
df3
2
  • i is an integer location. loc is label based location. You should use iloc as above. df2.loc[i, df2.columns.get_loc('Total_Count')] = total. You'd need to initialize the Total_Count column first though with that approach. Commented Oct 13, 2021 at 17:13
  • label based location doesn't work here since you subset df1[df1['card'] == pc] you're not guaranteed to get a contiguous default range index 0-length. Commented Oct 13, 2021 at 17:14

1 Answer 1

1

To modify the current implementation we can use groupby to create our df2 which allows us to apply a function to each grouped DataFrame to create the new column. This should offer similar performance as the current implementation but produce correctly aligned Series:

def calc_total_count(df2: pd.DataFrame) -> pd.Series:
    total = 0
    pre_count = 0
    pre_site = ''
    lst = []
    for c, s in zip(df2['count_record'], df2['site']):
        if s == pre_site:
            total += (c - pre_count)
        else:
            total += c

        pre_count = c
        pre_site = s
        lst.append(total)
    return pd.Series(lst, index=df2.index, name='Total_Count')


df3 = pd.concat([
    df1,
    df1.sort_values('date').groupby('card').apply(calc_total_count).droplevel(0)
], axis=1)

Alternatively we can use groupby, then within groups Series.shift to get the previous site, and count_record. Then use np.where to conditionally determine each row's value and ndarray.cumsum to calculate the cumulative total of the resulting values:

def calc_total_count(df2: pd.DataFrame) -> pd.Series:
    return pd.Series(
        np.where(df2['site'] == df2['site'].shift(),
                 df2['count_record'] - df2['count_record'].shift(fill_value=0),
                 df2['count_record']).cumsum(),
        index=df2.index,
        name='Total_Count'
    )


df3 = pd.concat([
    df1,
    df1.sort_values('date').groupby('card').apply(calc_total_count).droplevel(0)
], axis=1)

Either approach produces df3:

  site card    date  count_record  Total_Count
0    A   C1  12-Oct             5            5
1    A   C1  13-Oct            10           10
2    A   C1  14-Oct            18           18
3    A   C1  15-Oct            21           21
4    A   C1  16-Oct            29           29
5    B   C2  12-Oct            11           11
6    A   C2  13-Oct             2           13
7    A   C2  14-Oct             7           18
8    A   C2  15-Oct            13           24
9    B   C2  16-Oct             4           28

Setup and imports:

import numpy as np  # only needed if using np.where
import pandas as pd

df1 = pd.DataFrame(columns=['site', 'card', 'date', 'count_record'],
                   data=[['A', 'C1', '12-Oct', 5],
                         ['A', 'C1', '13-Oct', 10],
                         ['A', 'C1', '14-Oct', 18],
                         ['A', 'C1', '15-Oct', 21],
                         ['A', 'C1', '16-Oct', 29],
                         ['B', 'C2', '12-Oct', 11],
                         ['A', 'C2', '13-Oct', 2],
                         ['A', 'C2', '14-Oct', 7],
                         ['A', 'C2', '15-Oct', 13],
                         ['B', 'C2', '16-Oct', 4]])
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much @Henry Ecker ! Your code is working perfectly, as well as solved my trouble of cumulative summation. You saved me a lot of time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.