Python dataframe unexpected display error using loc

Question

I'm creating an additional column "Total_Count" to store the cumulative count record by Site and Count_Record column information. My coding is almost done for total cumulative count. However, the Total_Count column is shift for a specific Card as below. Could someone help with code modification, thank you!

Expected Output:

Current Output:

My Code:

import pandas as pd
df1 = pd.DataFrame(columns=['site', 'card', 'date', 'count_record'],
      data=[['A', 'C1', '12-Oct', 5], 
            ['A', 'C1', '13-Oct', 10], 
            ['A', 'C1', '14-Oct', 18],
            ['A', 'C1', '15-Oct', 21], 
            ['A', 'C1', '16-Oct', 29],
            ['B', 'C2', '12-Oct', 11],
            ['A', 'C2', '13-Oct', 2],
            ['A', 'C2', '14-Oct', 7],
            ['A', 'C2', '15-Oct', 13],
            ['B', 'C2', '16-Oct', 4]])

df_append_temp=[]

total = 0
preCard = ''
preSite = ''
preCount = 0

for pc in df1['card'].unique():
    df2 = df1[df1['card'] == pc].sort_values(['date'])

    total = 0

    for i in range(0, len(df2)):
        site = df2.iloc[i]['site']
        count = df2.iloc[i]['count_record']
    
        if site == preSite:
            total += (count - preCount)
        else:
            total += count
    
        preCount = count
        preSite = site

        df2.loc[i, 'Total_Count'] = total #something wrong using loc here
    
    df_append_temp.append(df2)

df3 = pd.DataFrame(pd.concat(df_append_temp), columns=df2.columns)
df3

i is an integer location. loc is label based location. You should use iloc as above. df2.loc[i, df2.columns.get_loc('Total_Count')] = total. You'd need to initialize the Total_Count column first though with that approach. — Henry Ecker
– Henry Ecker ♦, Commented Oct 13, 2021 at 17:13
label based location doesn't work here since you subset df1[df1['card'] == pc] you're not guaranteed to get a contiguous default range index 0-length. — Henry Ecker
– Henry Ecker ♦, Commented Oct 13, 2021 at 17:14

Henry Ecker · Accepted Answer · 2021-10-14 12:50:03Z

To modify the current implementation we can use groupby to create our df2 which allows us to apply a function to each grouped DataFrame to create the new column. This should offer similar performance as the current implementation but produce correctly aligned Series:

def calc_total_count(df2: pd.DataFrame) -> pd.Series:
    total = 0
    pre_count = 0
    pre_site = ''
    lst = []
    for c, s in zip(df2['count_record'], df2['site']):
        if s == pre_site:
            total += (c - pre_count)
        else:
            total += c

        pre_count = c
        pre_site = s
        lst.append(total)
    return pd.Series(lst, index=df2.index, name='Total_Count')


df3 = pd.concat([
    df1,
    df1.sort_values('date').groupby('card').apply(calc_total_count).droplevel(0)
], axis=1)

Alternatively we can use groupby, then within groups Series.shift to get the previous site, and count_record. Then use np.where to conditionally determine each row's value and ndarray.cumsum to calculate the cumulative total of the resulting values:

def calc_total_count(df2: pd.DataFrame) -> pd.Series:
    return pd.Series(
        np.where(df2['site'] == df2['site'].shift(),
                 df2['count_record'] - df2['count_record'].shift(fill_value=0),
                 df2['count_record']).cumsum(),
        index=df2.index,
        name='Total_Count'
    )


df3 = pd.concat([
    df1,
    df1.sort_values('date').groupby('card').apply(calc_total_count).droplevel(0)
], axis=1)

Either approach produces df3:

  site card    date  count_record  Total_Count
0    A   C1  12-Oct             5            5
1    A   C1  13-Oct            10           10
2    A   C1  14-Oct            18           18
3    A   C1  15-Oct            21           21
4    A   C1  16-Oct            29           29
5    B   C2  12-Oct            11           11
6    A   C2  13-Oct             2           13
7    A   C2  14-Oct             7           18
8    A   C2  15-Oct            13           24
9    B   C2  16-Oct             4           28

Setup and imports:

import numpy as np  # only needed if using np.where
import pandas as pd

df1 = pd.DataFrame(columns=['site', 'card', 'date', 'count_record'],
                   data=[['A', 'C1', '12-Oct', 5],
                         ['A', 'C1', '13-Oct', 10],
                         ['A', 'C1', '14-Oct', 18],
                         ['A', 'C1', '15-Oct', 21],
                         ['A', 'C1', '16-Oct', 29],
                         ['B', 'C2', '12-Oct', 11],
                         ['A', 'C2', '13-Oct', 2],
                         ['A', 'C2', '14-Oct', 7],
                         ['A', 'C2', '15-Oct', 13],
                         ['B', 'C2', '16-Oct', 4]])

Thank you so much @Henry Ecker ! Your code is working perfectly, as well as solved my trouble of cumulative summation. You saved me a lot of time.

Collectives™ on Stack Overflow

Python dataframe unexpected display error using loc

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related