Imputing missing Dates in Pandas Dataframe

Question

df = pd.read_csv(
    'https://media-doselect.s3.amazonaws.com/generic/MJjpYqLzv08xAkjqLp1ga1Aq/Historical_Data.csv')
df.head()

    Date        Article_ID   Country_Code   Sold_Units
0   20170817        1132       AT               1
1   20170818        1132       AT               1
2   20170821        1132       AT               1
3   20170822        1132       AT               1
4   20170906        1132       AT               1

I have the above-given DataFrame. Note that the Date column is of type int64 and has missing dates 19th and 20th.

I want to bring it to the format yyyy-mm-dd and impute the missing dates with values 0 in Article ID, Outlet Code and Sold Units.

So far I have tried:

df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y-%m-%d')

to get the dates in the required format.

    Date         Article_ID  Outlet_Code   Sold_Units
0   2017-08-17      1132       AT               1
1   2017-08-18      1132       AT               1
2   2017-08-21      1132       AT               1
3   2017-08-22      1132       AT               1
4   2017-09-06      1132       AT               1

However, how do I impute the missing dates of 19th and 20th and impute the rows with 0s under the newly added date rows?

Here is a snippet of what I have done which is returning a value error: cannot reindex from a duplicate axis.

ansev · Accepted Answer · 2019-10-20 14:57:45Z

3

You can use DataFrame.asfreq to reindex after deleting duplicates and then adding duplicate data and sorting:

df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y-%m-%d')
df2=df[df.duplicated('Date')].set_index('Date')
new_df=df.drop_duplicates('Date').set_index('Date').asfreq('D',fill_value=0)
new_df=new_df.append(df2).sort_index().reset_index()
print(new_df)

         Date  Article_ID Country_Code  Sold_Units
0  2017-08-17        1132           AT           1
1  2017-08-17        1132           AT           1
2  2017-08-18        1132           AT           1
3  2017-08-19           0            0           0
4  2017-08-20           0            0           0
5  2017-08-21        1132           AT           1
6  2017-08-22        1132           AT           1
7  2017-08-23           0            0           0
8  2017-08-24           0            0           0
9  2017-08-25           0            0           0
10 2017-08-26           0            0           0
11 2017-08-27           0            0           0
12 2017-08-28           0            0           0
13 2017-08-29           0            0           0
14 2017-08-30           0            0           0
15 2017-08-31           0            0           0
16 2017-09-01           0            0           0
17 2017-09-02           0            0           0
18 2017-09-03           0            0           0
19 2017-09-04           0            0           0
20 2017-09-05           0            0           0
21 2017-09-06        1132           AT           1

edited Oct 20, 2019 at 14:57

answered Oct 20, 2019 at 13:39

ansev

31k5 gold badges21 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

therion Over a year ago

I get this error while trying your code. ValueError: cannot reindex from a duplicate axis

therion Over a year ago

df = pd.read_csv( 'media-doselect.s3.amazonaws.com/generic/…) This is the dataset.

therion Over a year ago

Using Groupby(level=0) command is only adding Level_0 before the Date column.

therion Over a year ago

I have added my output image which is returning the value error.

ansev Over a year ago

I have updated the code. The strategy can be to reindex without duplicate rows and add these later. Please check this attempt :)

|

powerPixie · Accepted Answer · 2019-10-20 13:45:23Z

0

You can use:

df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y-%m-%d',errors='coerce')

You don't miss your missing date, but it is represented by NaT.

You've got something like this

       Date  Article_ID Outlet_Code  Sold_Units
 0 2017-08-17        1132          AT           1
 1 2017-08-18        1132          AT           1
 2        NaT        1132          AT           1

answered Oct 20, 2019 at 13:45

powerPixie

6989 silver badges22 bronze badges

6 Comments

therion Over a year ago

I want the imputed date to be appended to the dataframe.

powerPixie Over a year ago

What date? If you don't have the value, because it's missing, you are faking one. You can suppose one, because your data seems ordered, but you can't be 100% sure.

therion Over a year ago

In the dataframe, 19th and 20th should be added as they are missing dates. And the values against the aritcle code and sold units should be 0.

powerPixie Over a year ago

I didn't use the complete dataframe, I used just your first 3 records and left the data null on the 3rd one. What "errors = 'coerce' " does is to ignore the errors, and transform your data, no matter what.

powerPixie Over a year ago

Now I see that you were talking about gaps between dates. I've assumed it was NaN values in your date column. My bad, sorry. But I see someone already provided an effective answer.

|

Collectives™ on Stack Overflow

Imputing missing Dates in Pandas Dataframe

2 Answers 2

10 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related