Python add missing rows to dataframe

Question

I have a dataframe which can sometime have incomplete data. For example this one below stops at Hour 22 instead of 23

           Date  Hour  Interval     Source         ID  Number of Messages
0    2020-05-19     0         0          1  413379290                  23
1    2020-05-19     0        15          1  413379290                  36
2    2020-05-19     0        30          1  413379290                  31
3    2020-05-19     0        45          1  413379290                  14
4    2020-05-19     1         0          1  413379290                   3
..          ...   ...       ...        ...        ...                 ...
183  2020-05-20    21        45          1  413379290                   6
184  2020-05-20    22         0          1  413379290                   8
185  2020-05-20    22        15          1  413379290                   4
186  2020-05-20    22        30          1  413379290                   6
187  2020-05-20    22        45          1  413379290                   9

How do I use pandas so that it looks like this?

           Date  Hour  Interval     Source         ID  Number of Messages
0    2020-05-19     0         0          1  413379290                  23
1    2020-05-19     0        15          1  413379290                  36
2    2020-05-19     0        30          1  413379290                  31
3    2020-05-19     0        45          1  413379290                  14
4    2020-05-19     1         0          1  413379290                   3
..          ...   ...       ...        ...        ...                 ...
183  2020-05-20    21        45          1  413379290                   6
184  2020-05-20    22         0          1  413379290                   8
185  2020-05-20    22        15          1  413379290                   4
186  2020-05-20    22        30          1  413379290                   6
187  2020-05-20    22        45          1  413379290                   9
188  2020-05-20    23         0          1  413379290                   NaN
189  2020-05-20    23        15          1  413379290                   NaN
190  2020-05-20    23        30          1  413379290                   NaN
191  2020-05-20    23        45          1  413379290                   NaN

kindly present a small sample dataframe, without the ..., so that an answer can be replicated easily. You can then scale the answer to your project or larger data — sammywemmy
– sammywemmy, Commented Mar 17, 2021 at 20:01
The ... is used to scale down the dataframe. It is a simple dataframe which contains rows for hours 0-23. Sometimes 1 or multiple hours are missing and I presented a simple case where 1 hour is missing. I would like to fill the rows for the missing hours. — cslurker31
– cslurker31, Commented Mar 17, 2021 at 20:06

Ben.T · Accepted Answer · 2021-03-17 21:00:33Z

2

You can use reindex and to create the missing hours, you can use existing values in all the columns (except Number of messages), create a MultiIndex.from_frame, then use MultiIndex.from_product to recreate all the values changing existing values in Hour by range(24). Then on the dataframe, set_index and reindex with all values

# all except the one you want nan in
cols = ['Date','Hour', 'Interval', 'Source','ID']
#create the multiindex with all values
new_idx = (
    pd.MultiIndex.from_product(
        [lv if col != 'Hour' else range(24) #replace existing values by range 0 to 23
         for col, lv in zip(cols, pd.MultiIndex.from_frame(df[cols]).levels)
         ], names=cols)
)
#reindex the original df, you can reassign to same df if you want
new_df = (
    df.set_index(cols)
      .reindex(new_idx)
      .reset_index()
)

answered Mar 17, 2021 at 21:00

Ben.T

29.7k6 gold badges39 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Joe Ferndz Over a year ago

Thanks @Ben.T for this solution. I didn't think of the MultiIndex approach. Learnt something new today

Ben.T Over a year ago

@JoeFerndz yeah it is a good trick to use MultiIndex.from_product to insert missing values when the pattern is regular like here :) your idea of using the 15 min interval is a bit out of the box but pretty smart in this case, and shows how everyone see problems differently :)

Joe Ferndz · Accepted Answer · 2021-03-18 00:06:39Z

The approach I would take is to find the min and max of date, then create a range of dates with 15 minute interval. Use df.merge to add all values from df to the newly created dataframe.

Note here that the date starts from 2020-05-19 01:00:00 and not 00:00:00. So the final output will also start from 01:00:00 and not 00:00:00

import pandas as pd
c = ['Date','Hour','Interval','Source','ID','Number of Messages']
d = [
['2020-05-19',     1,         0,          1,  413379290,                  23],
['2020-05-19',     1,        15,          1,  413379290,                  36],
['2020-05-19',     1,        30,          1,  413379290,                  31],
['2020-05-19',     1,        45,          1,  413379290,                  14],
['2020-05-19',     2,         0,          1,  413379290,                   3],
['2020-05-20',    21,        45,          1,  413379290,                   6],
['2020-05-20',    22,         0,          1,  413379290,                   8],
['2020-05-20',    22,        15,          1,  413379290,                   4],
['2020-05-20',    22,        30,          1,  413379290,                   6],
['2020-05-20',    22,        45,          1,  413379290,                   9]]

df = pd.DataFrame(d,columns=c)
df['Date'] = pd.to_datetime(df['Date'])

print (df)

#first get the start and end period by adding Hour and Interval to Date
df['DateFull'] = df.Date + pd.to_timedelta(df.Hour,unit='h') + pd.to_timedelta(df.Interval,unit='m')

#Create a range of dates with 15 mins interval from Start Date (including Hour & Min) to Last Day + 23:45
df1 = pd.DataFrame({'DateFull':pd.date_range(df.DateFull.min(), df.DateFull.max().floor('d') + pd.to_timedelta('23:45:00'), freq='15T')})

#Create columns with Hour and Interval based on new Date Range
df1['Hour'] = df1.DateFull.dt.hour
df1['Interval'] = df1.DateFull.dt.minute

#Merge on DateFull, Hour, Interval to get the full set merged with original DF
df1 = df1.merge(df, how='left', on=['DateFull','Hour','Interval'])

#forward fill Date, Source and ID 
df1[['Date','Source','ID']] = df1[['Date','Source','ID']].ffill()

#convert Source and ID to int
df1[['Source','ID']] = df1[['Source','ID']].astype(int)

#Drop DateFull as it is no longer needed
df1.drop(columns ='DateFull',inplace=True)

#Reset index to original column 
df1 = df1.reindex(c, axis=1)

print (df1)

Original DataFrame:

        Date  Hour  Interval  Source         ID  Number of Messages
0 2020-05-19     1         0       1  413379290                  23
1 2020-05-19     1        15       1  413379290                  36
2 2020-05-19     1        30       1  413379290                  31
3 2020-05-19     1        45       1  413379290                  14
4 2020-05-19     2         0       1  413379290                   3
5 2020-05-20    21        45       1  413379290                   6
6 2020-05-20    22         0       1  413379290                   8
7 2020-05-20    22        15       1  413379290                   4
8 2020-05-20    22        30       1  413379290                   6
9 2020-05-20    22        45       1  413379290                   9

Final DataFrame:

          Date  Hour  Interval  Source         ID  Number of Messages
0   2020-05-19     1         0       1  413379290                23.0
1   2020-05-19     1        15       1  413379290                36.0
2   2020-05-19     1        30       1  413379290                31.0
3   2020-05-19     1        45       1  413379290                14.0
4   2020-05-19     2         0       1  413379290                 3.0
..         ...   ...       ...     ...        ...                 ...
183 2020-05-20    22        45       1  413379290                 9.0
184 2020-05-20    23         0       1  413379290                 NaN
185 2020-05-20    23        15       1  413379290                 NaN
186 2020-05-20    23        30       1  413379290                 NaN
187 2020-05-20    23        45       1  413379290                 NaN

df1.tail(20) gives you this:

          Date  Hour  Interval  Source         ID  Number of Messages
168 2020-05-19    19         0       1  413379290                 NaN
169 2020-05-19    19        15       1  413379290                 NaN
170 2020-05-19    19        30       1  413379290                 NaN
171 2020-05-19    19        45       1  413379290                 NaN
172 2020-05-19    20         0       1  413379290                 NaN
173 2020-05-19    20        15       1  413379290                 NaN
174 2020-05-19    20        30       1  413379290                 NaN
175 2020-05-19    20        45       1  413379290                 NaN
176 2020-05-19    21         0       1  413379290                 NaN
177 2020-05-19    21        15       1  413379290                 NaN
178 2020-05-19    21        30       1  413379290                 NaN
179 2020-05-20    21        45       1  413379290                 6.0
180 2020-05-20    22         0       1  413379290                 8.0
181 2020-05-20    22        15       1  413379290                 4.0
182 2020-05-20    22        30       1  413379290                 6.0
183 2020-05-20    22        45       1  413379290                 9.0
184 2020-05-20    23         0       1  413379290                 NaN
185 2020-05-20    23        15       1  413379290                 NaN
186 2020-05-20    23        30       1  413379290                 NaN
187 2020-05-20    23        45       1  413379290                 NaN

norie · Accepted Answer · 2021-03-17 21:17:35Z

1

You can do this by creating a new dataframe with all the appropriate values for the first five columns and then merge with the original dataframe to get the values from the No of Messages column in the appropriate row.

import pandas as pd

df = pd.read_csv('test.csv')

dates = df['Date'].unique()

hrs = [hr for hr in range(24) for i in range(4)]*len(dates)

intervals = [0, 15, 30, 45] * 24 *len(dates)

new_df = pd.DataFrame()
new_df['Date'] = [dt for dt in dates for i in range(24*4)]

new_df['Hour'] = hrs
new_df['Interval'] = intervals
new_df['Source'] = df['Source'].iloc[0]
new_df['ID'] = str(df['ID'].iloc[0])

new_df = new_df.merge(df, how='left', on=['Date', 'Hour', 'Interval']).drop(['Source_y', 'ID_y'], axis=1)
new_df.rename(columns={'Source_x':'Source', 'ID_x':'ID'},inplace=True)

new_df.to_excel('testit.xlsx')

edited Mar 17, 2021 at 21:17

answered Mar 17, 2021 at 19:35

norie

9,9372 gold badges14 silver badges19 bronze badges

7 Comments

cslurker31 Over a year ago

I tried this and I only got 4 entire rows of NaN. Is there any way to keep the date interval, source, and id consistent with the rest of the data frame append 23 in the hour column as seen in my post?

norie Over a year ago

I've updated my answer and what I posted does work but there's probably a better way to do it.

mosc9575 Over a year ago

You can add a foreward fill df = df.ffill(). This would fill all columns with the last value. In your case you maybe want to do this only on a few colums.

norie Over a year ago

@nan I've update the code so it should handle missing rows any where in the original dataframe. It could probably be improved though , for example what if there's more than 2 days of data?

norie Over a year ago

I've updated the code again to handle an arbitrary no of dates.

|

Collectives™ on Stack Overflow

Python add missing rows to dataframe

3 Answers 3

2 Comments

Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related