1

I have the following data in my dataframe:

uniquecode1 year    month   Name  Sale
    1029    2020      5     ABC    10
    1029    2020      6     ABC    20
    1029    2020      10    ABC    30 
    1029    2020      11    ABC    35
    1029    2020      12    ABC    38
    1050    2020      4     DEF    39
    1050    2020      5     DEF    40
    1050    2020      6     DEF    31
    1050    2020      7     DEF    45
    1050    2020      8     DEF    55
    1079    2020      4     GHI    65
    1079    2021      2     GHI    75
    10810   2021      1     XYZ    85

Let us say we are sitting in Mar'21. For the upper range of month in 2021, we will limit ourselves to Mar'21 minus 1 i.e. Feb'21

We see that data is divided into groups of different uniquecode1. For every group of uniquecode1, we have values missing in the column 'month'.

  1. For 1029, we have missing month values 7,8,9 for 2020 and 1,2 for 2021
  2. For 1050, we have missing month values 9,10,11,12 for 2020 and 1,2 for 2021
  3. For 1079, we have missing month values 5,6,7,8,9,10,11,12 for 2020 and 1 for 2021
  4. For 10810, we have missing month values 4,5,6,7,8,9,10,11,12 for 2020 and 2 for 2021

I am new to pandas. I am trying to build a logic which takes care of the above missing values. When the missing month and year values are inserted into the data, 'uniquecode1' and 'name' should be copied from their respective group values and 'Sale' should have value 0 or NaN.

Can somebody help me write a code for it in pandas? Let me know what other details you might require.

2 Answers 2

2

You can convert year with month to datetimes and then add all missing combination with DataFrame.set_index Series.unstack with new 0 for non exist values and DataFrame.stack with Series.reset_index for original format:

df['dates'] = pd.to_datetime(df[['year','month']].assign(day=1))

df = (df.set_index(['uniquecode1','Name', 'dates'])['Sale']
        .unstack(fill_value=0)
        .stack()
        .reset_index(name='Sale'))

print (df.head(10))
    uniquecode1 Name      dates  Sale
0          1029  ABC 2020-04-01     0
1          1029  ABC 2020-05-01    10
2          1029  ABC 2020-06-01    20
3          1029  ABC 2020-07-01     0
4          1029  ABC 2020-08-01     0
5          1029  ABC 2020-10-01    30
6          1029  ABC 2020-11-01    35
7          1029  ABC 2020-12-01    38
8          1029  ABC 2021-01-01     0
9          1029  ABC 2021-02-01     0

Last for add year and months:

df = df.assign(year = df['dates'].dt.year, month = df['dates'].dt.month)
print (df.head())
   uniquecode1 Name      dates  Sale  year  month
0         1029  ABC 2020-04-01     0  2020      4
1         1029  ABC 2020-05-01    10  2020      5
2         1029  ABC 2020-06-01    20  2020      6
3         1029  ABC 2020-07-01     0  2020      7
4         1029  ABC 2020-08-01     0  2020      8

But unfortuantely there missing 09-2020, so is necessary add DataFrame.reindex:

df['dates'] = pd.to_datetime(df[['year','month']].assign(day=1))
mux = pd.date_range(df['dates'].min(), df['dates'].max(), freq='MS', name='dates')

#for add maximum manaully
#mux = pd.date_range(df['dates'].min(), '2021-03-01', freq='MS', name='dates')

df = (df.set_index(['uniquecode1','Name', 'dates'])['Sale']
        .unstack(fill_value=0)
        .reindex(mux, axis=1, fill_value=0)
        .stack()
        .reset_index(name='Sale')
        )

df = df.assign(year = df['dates'].dt.year, month = df['dates'].dt.month)
print (df.head(10))
   uniquecode1 Name      dates  Sale  year  month
0         1029  ABC 2020-04-01     0  2020      4
1         1029  ABC 2020-05-01    10  2020      5
2         1029  ABC 2020-06-01    20  2020      6
3         1029  ABC 2020-07-01     0  2020      7
4         1029  ABC 2020-08-01     0  2020      8
5         1029  ABC 2020-09-01     0  2020      9
6         1029  ABC 2020-10-01    30  2020     10
7         1029  ABC 2020-11-01    35  2020     11
8         1029  ABC 2020-12-01    38  2020     12
9         1029  ABC 2021-01-01     0  2021      1
Sign up to request clarification or add additional context in comments.

Comments

0

One option is with the complete function from pyjanitor, which can be helpful in exposing explicitly missing rows (and can be helpful as well in abstracting the reshaping process):

# pip install pyjanitor
import pandas as pd
import janitor

# create date column, combining year and month
df['dates'] = pd.to_datetime(df[['year','month']].assign(day=1))

# build a dictionary for each group
# where the start date is the first date in the group
# and the last date is `2021-02-01`
dates = {'dates': lambda df: pd.date_range(df.min(), '2021-02-01', freq='MS')}

# apply the function, with uniquecode1 and name as the groupby names
# and do some cleanup to get the final output
(df.complete(dates, by=['uniquecode1', 'Name'], sort =True)
   .fillna({'Sale':0}, downcast='infer')
   .assign(year = lambda df: df.dates.dt.year,
           month = lambda df: df.dates.dt.month)
   .drop(columns='dates')
)

    uniquecode1  year  month Name  Sale
0          1029  2020      5  ABC    10
1          1029  2020      6  ABC    20
2          1029  2020      7  ABC     0
3          1029  2020      8  ABC     0
4          1029  2020      9  ABC     0
5          1029  2020     10  ABC    30
6          1029  2020     11  ABC    35
7          1029  2020     12  ABC    38
8          1029  2021      1  ABC     0
9          1029  2021      2  ABC     0
10         1050  2020      4  DEF    39
11         1050  2020      5  DEF    40
12         1050  2020      6  DEF    31
13         1050  2020      7  DEF    45
14         1050  2020      8  DEF    55
15         1050  2020      9  DEF     0
16         1050  2020     10  DEF     0
17         1050  2020     11  DEF     0
18         1050  2020     12  DEF     0
19         1050  2021      1  DEF     0
20         1050  2021      2  DEF     0
21         1079  2020      4  GHI    65
22         1079  2020      5  GHI     0
23         1079  2020      6  GHI     0
24         1079  2020      7  GHI     0
25         1079  2020      8  GHI     0
26         1079  2020      9  GHI     0
27         1079  2020     10  GHI     0
28         1079  2020     11  GHI     0
29         1079  2020     12  GHI     0
30         1079  2021      1  GHI     0
31         1079  2021      2  GHI    75
32        10810  2021      1  XYZ    85
33        10810  2021      2  XYZ     0

There is no 2020 data for 10810 in your sample dataframe, and as such there is none in the final output above.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.