Find missing number from a sequence in Pandas dataframes

Question

I have the following data in my dataframe:

uniquecode1 year    month   Name  Sale
    1029    2020      5     ABC    10
    1029    2020      6     ABC    20
    1029    2020      10    ABC    30 
    1029    2020      11    ABC    35
    1029    2020      12    ABC    38
    1050    2020      4     DEF    39
    1050    2020      5     DEF    40
    1050    2020      6     DEF    31
    1050    2020      7     DEF    45
    1050    2020      8     DEF    55
    1079    2020      4     GHI    65
    1079    2021      2     GHI    75
    10810   2021      1     XYZ    85

Let us say we are sitting in Mar'21. For the upper range of month in 2021, we will limit ourselves to Mar'21 minus 1 i.e. Feb'21

We see that data is divided into groups of different uniquecode1. For every group of uniquecode1, we have values missing in the column 'month'.

For 1029, we have missing month values 7,8,9 for 2020 and 1,2 for 2021
For 1050, we have missing month values 9,10,11,12 for 2020 and 1,2 for 2021
For 1079, we have missing month values 5,6,7,8,9,10,11,12 for 2020 and 1 for 2021
For 10810, we have missing month values 4,5,6,7,8,9,10,11,12 for 2020 and 2 for 2021

I am new to pandas. I am trying to build a logic which takes care of the above missing values. When the missing month and year values are inserted into the data, 'uniquecode1' and 'name' should be copied from their respective group values and 'Sale' should have value 0 or NaN.

Can somebody help me write a code for it in pandas? Let me know what other details you might require.

jezrael · Accepted Answer · 2021-03-26 07:57:01Z

You can convert year with month to datetimes and then add all missing combination with DataFrame.set_index Series.unstack with new 0 for non exist values and DataFrame.stack with Series.reset_index for original format:

df['dates'] = pd.to_datetime(df[['year','month']].assign(day=1))

df = (df.set_index(['uniquecode1','Name', 'dates'])['Sale']
        .unstack(fill_value=0)
        .stack()
        .reset_index(name='Sale'))

print (df.head(10))
    uniquecode1 Name      dates  Sale
0          1029  ABC 2020-04-01     0
1          1029  ABC 2020-05-01    10
2          1029  ABC 2020-06-01    20
3          1029  ABC 2020-07-01     0
4          1029  ABC 2020-08-01     0
5          1029  ABC 2020-10-01    30
6          1029  ABC 2020-11-01    35
7          1029  ABC 2020-12-01    38
8          1029  ABC 2021-01-01     0
9          1029  ABC 2021-02-01     0

Last for add year and months:

df = df.assign(year = df['dates'].dt.year, month = df['dates'].dt.month)
print (df.head())
   uniquecode1 Name      dates  Sale  year  month
0         1029  ABC 2020-04-01     0  2020      4
1         1029  ABC 2020-05-01    10  2020      5
2         1029  ABC 2020-06-01    20  2020      6
3         1029  ABC 2020-07-01     0  2020      7
4         1029  ABC 2020-08-01     0  2020      8

But unfortuantely there missing 09-2020, so is necessary add DataFrame.reindex:

df['dates'] = pd.to_datetime(df[['year','month']].assign(day=1))
mux = pd.date_range(df['dates'].min(), df['dates'].max(), freq='MS', name='dates')

#for add maximum manaully
#mux = pd.date_range(df['dates'].min(), '2021-03-01', freq='MS', name='dates')

df = (df.set_index(['uniquecode1','Name', 'dates'])['Sale']
        .unstack(fill_value=0)
        .reindex(mux, axis=1, fill_value=0)
        .stack()
        .reset_index(name='Sale')
        )

df = df.assign(year = df['dates'].dt.year, month = df['dates'].dt.month)
print (df.head(10))
   uniquecode1 Name      dates  Sale  year  month
0         1029  ABC 2020-04-01     0  2020      4
1         1029  ABC 2020-05-01    10  2020      5
2         1029  ABC 2020-06-01    20  2020      6
3         1029  ABC 2020-07-01     0  2020      7
4         1029  ABC 2020-08-01     0  2020      8
5         1029  ABC 2020-09-01     0  2020      9
6         1029  ABC 2020-10-01    30  2020     10
7         1029  ABC 2020-11-01    35  2020     11
8         1029  ABC 2020-12-01    38  2020     12
9         1029  ABC 2021-01-01     0  2021      1

sammywemmy · Accepted Answer · 2021-12-19 03:41:26Z

One option is with the complete function from pyjanitor, which can be helpful in exposing explicitly missing rows (and can be helpful as well in abstracting the reshaping process):

# pip install pyjanitor
import pandas as pd
import janitor

# create date column, combining year and month
df['dates'] = pd.to_datetime(df[['year','month']].assign(day=1))

# build a dictionary for each group
# where the start date is the first date in the group
# and the last date is `2021-02-01`
dates = {'dates': lambda df: pd.date_range(df.min(), '2021-02-01', freq='MS')}

# apply the function, with uniquecode1 and name as the groupby names
# and do some cleanup to get the final output
(df.complete(dates, by=['uniquecode1', 'Name'], sort =True)
   .fillna({'Sale':0}, downcast='infer')
   .assign(year = lambda df: df.dates.dt.year,
           month = lambda df: df.dates.dt.month)
   .drop(columns='dates')
)

    uniquecode1  year  month Name  Sale
0          1029  2020      5  ABC    10
1          1029  2020      6  ABC    20
2          1029  2020      7  ABC     0
3          1029  2020      8  ABC     0
4          1029  2020      9  ABC     0
5          1029  2020     10  ABC    30
6          1029  2020     11  ABC    35
7          1029  2020     12  ABC    38
8          1029  2021      1  ABC     0
9          1029  2021      2  ABC     0
10         1050  2020      4  DEF    39
11         1050  2020      5  DEF    40
12         1050  2020      6  DEF    31
13         1050  2020      7  DEF    45
14         1050  2020      8  DEF    55
15         1050  2020      9  DEF     0
16         1050  2020     10  DEF     0
17         1050  2020     11  DEF     0
18         1050  2020     12  DEF     0
19         1050  2021      1  DEF     0
20         1050  2021      2  DEF     0
21         1079  2020      4  GHI    65
22         1079  2020      5  GHI     0
23         1079  2020      6  GHI     0
24         1079  2020      7  GHI     0
25         1079  2020      8  GHI     0
26         1079  2020      9  GHI     0
27         1079  2020     10  GHI     0
28         1079  2020     11  GHI     0
29         1079  2020     12  GHI     0
30         1079  2021      1  GHI     0
31         1079  2021      2  GHI    75
32        10810  2021      1  XYZ    85
33        10810  2021      2  XYZ     0

There is no 2020 data for 10810 in your sample dataframe, and as such there is none in the final output above.

Collectives™ on Stack Overflow

Find missing number from a sequence in Pandas dataframes

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related