3
df = pd.read_csv(
    'https://media-doselect.s3.amazonaws.com/generic/MJjpYqLzv08xAkjqLp1ga1Aq/Historical_Data.csv')
df.head()

    Date        Article_ID   Country_Code   Sold_Units
0   20170817        1132       AT               1
1   20170818        1132       AT               1
2   20170821        1132       AT               1
3   20170822        1132       AT               1
4   20170906        1132       AT               1

I have the above-given DataFrame. Note that the Date column is of type int64 and has missing dates 19th and 20th.

I want to bring it to the format yyyy-mm-dd and impute the missing dates with values 0 in Article ID, Outlet Code and Sold Units.

So far I have tried:

df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y-%m-%d')

to get the dates in the required format.

    Date         Article_ID  Outlet_Code   Sold_Units
0   2017-08-17      1132       AT               1
1   2017-08-18      1132       AT               1
2   2017-08-21      1132       AT               1
3   2017-08-22      1132       AT               1
4   2017-09-06      1132       AT               1

However, how do I impute the missing dates of 19th and 20th and impute the rows with 0s under the newly added date rows?

Here is a snippet of what I have done which is returning a value error: cannot reindex from a duplicate axis.

enter image description here

2 Answers 2

3

You can use DataFrame.asfreq to reindex after deleting duplicates and then adding duplicate data and sorting:

df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y-%m-%d')
df2=df[df.duplicated('Date')].set_index('Date')
new_df=df.drop_duplicates('Date').set_index('Date').asfreq('D',fill_value=0)
new_df=new_df.append(df2).sort_index().reset_index()
print(new_df)

         Date  Article_ID Country_Code  Sold_Units
0  2017-08-17        1132           AT           1
1  2017-08-17        1132           AT           1
2  2017-08-18        1132           AT           1
3  2017-08-19           0            0           0
4  2017-08-20           0            0           0
5  2017-08-21        1132           AT           1
6  2017-08-22        1132           AT           1
7  2017-08-23           0            0           0
8  2017-08-24           0            0           0
9  2017-08-25           0            0           0
10 2017-08-26           0            0           0
11 2017-08-27           0            0           0
12 2017-08-28           0            0           0
13 2017-08-29           0            0           0
14 2017-08-30           0            0           0
15 2017-08-31           0            0           0
16 2017-09-01           0            0           0
17 2017-09-02           0            0           0
18 2017-09-03           0            0           0
19 2017-09-04           0            0           0
20 2017-09-05           0            0           0
21 2017-09-06        1132           AT           1
Sign up to request clarification or add additional context in comments.

10 Comments

I get this error while trying your code. ValueError: cannot reindex from a duplicate axis
df = pd.read_csv( 'media-doselect.s3.amazonaws.com/generic/…) This is the dataset.
Using Groupby(level=0) command is only adding Level_0 before the Date column.
I have added my output image which is returning the value error.
I have updated the code. The strategy can be to reindex without duplicate rows and add these later. Please check this attempt :)
|
0

You can use:

df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y-%m-%d',errors='coerce')

You don't miss your missing date, but it is represented by NaT.

You've got something like this

       Date  Article_ID Outlet_Code  Sold_Units
 0 2017-08-17        1132          AT           1
 1 2017-08-18        1132          AT           1
 2        NaT        1132          AT           1

6 Comments

I want the imputed date to be appended to the dataframe.
What date? If you don't have the value, because it's missing, you are faking one. You can suppose one, because your data seems ordered, but you can't be 100% sure.
In the dataframe, 19th and 20th should be added as they are missing dates. And the values against the aritcle code and sold units should be 0.
I didn't use the complete dataframe, I used just your first 3 records and left the data null on the 3rd one. What "errors = 'coerce' " does is to ignore the errors, and transform your data, no matter what.
Now I see that you were talking about gaps between dates. I've assumed it was NaN values in your date column. My bad, sorry. But I see someone already provided an effective answer.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.