3

I have a dataframe which can sometime have incomplete data. For example this one below stops at Hour 22 instead of 23

           Date  Hour  Interval     Source         ID  Number of Messages
0    2020-05-19     0         0          1  413379290                  23
1    2020-05-19     0        15          1  413379290                  36
2    2020-05-19     0        30          1  413379290                  31
3    2020-05-19     0        45          1  413379290                  14
4    2020-05-19     1         0          1  413379290                   3
..          ...   ...       ...        ...        ...                 ...
183  2020-05-20    21        45          1  413379290                   6
184  2020-05-20    22         0          1  413379290                   8
185  2020-05-20    22        15          1  413379290                   4
186  2020-05-20    22        30          1  413379290                   6
187  2020-05-20    22        45          1  413379290                   9

How do I use pandas so that it looks like this?

           Date  Hour  Interval     Source         ID  Number of Messages
0    2020-05-19     0         0          1  413379290                  23
1    2020-05-19     0        15          1  413379290                  36
2    2020-05-19     0        30          1  413379290                  31
3    2020-05-19     0        45          1  413379290                  14
4    2020-05-19     1         0          1  413379290                   3
..          ...   ...       ...        ...        ...                 ...
183  2020-05-20    21        45          1  413379290                   6
184  2020-05-20    22         0          1  413379290                   8
185  2020-05-20    22        15          1  413379290                   4
186  2020-05-20    22        30          1  413379290                   6
187  2020-05-20    22        45          1  413379290                   9
188  2020-05-20    23         0          1  413379290                   NaN
189  2020-05-20    23        15          1  413379290                   NaN
190  2020-05-20    23        30          1  413379290                   NaN
191  2020-05-20    23        45          1  413379290                   NaN

4
  • kindly present a small sample dataframe, without the ..., so that an answer can be replicated easily. You can then scale the answer to your project or larger data Commented Mar 17, 2021 at 20:01
  • The ... is used to scale down the dataframe. It is a simple dataframe which contains rows for hours 0-23. Sometimes 1 or multiple hours are missing and I presented a simple case where 1 hour is missing. I would like to fill the rows for the missing hours. Commented Mar 17, 2021 at 20:06
  • are Source and ID the same in the dataframe? Commented Mar 17, 2021 at 20:09
  • Yes the source and ID are the same in the dataframe. Commented Mar 17, 2021 at 20:14

3 Answers 3

2

You can use reindex and to create the missing hours, you can use existing values in all the columns (except Number of messages), create a MultiIndex.from_frame, then use MultiIndex.from_product to recreate all the values changing existing values in Hour by range(24). Then on the dataframe, set_index and reindex with all values

# all except the one you want nan in
cols = ['Date','Hour', 'Interval', 'Source','ID']
#create the multiindex with all values
new_idx = (
    pd.MultiIndex.from_product(
        [lv if col != 'Hour' else range(24) #replace existing values by range 0 to 23
         for col, lv in zip(cols, pd.MultiIndex.from_frame(df[cols]).levels)
         ], names=cols)
)
#reindex the original df, you can reassign to same df if you want
new_df = (
    df.set_index(cols)
      .reindex(new_idx)
      .reset_index()
)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @Ben.T for this solution. I didn't think of the MultiIndex approach. Learnt something new today
@JoeFerndz yeah it is a good trick to use MultiIndex.from_product to insert missing values when the pattern is regular like here :) your idea of using the 15 min interval is a bit out of the box but pretty smart in this case, and shows how everyone see problems differently :)
2

The approach I would take is to find the min and max of date, then create a range of dates with 15 minute interval. Use df.merge to add all values from df to the newly created dataframe.

Note here that the date starts from 2020-05-19 01:00:00 and not 00:00:00. So the final output will also start from 01:00:00 and not 00:00:00

import pandas as pd
c = ['Date','Hour','Interval','Source','ID','Number of Messages']
d = [
['2020-05-19',     1,         0,          1,  413379290,                  23],
['2020-05-19',     1,        15,          1,  413379290,                  36],
['2020-05-19',     1,        30,          1,  413379290,                  31],
['2020-05-19',     1,        45,          1,  413379290,                  14],
['2020-05-19',     2,         0,          1,  413379290,                   3],
['2020-05-20',    21,        45,          1,  413379290,                   6],
['2020-05-20',    22,         0,          1,  413379290,                   8],
['2020-05-20',    22,        15,          1,  413379290,                   4],
['2020-05-20',    22,        30,          1,  413379290,                   6],
['2020-05-20',    22,        45,          1,  413379290,                   9]]

df = pd.DataFrame(d,columns=c)
df['Date'] = pd.to_datetime(df['Date'])

print (df)

#first get the start and end period by adding Hour and Interval to Date
df['DateFull'] = df.Date + pd.to_timedelta(df.Hour,unit='h') + pd.to_timedelta(df.Interval,unit='m')

#Create a range of dates with 15 mins interval from Start Date (including Hour & Min) to Last Day + 23:45
df1 = pd.DataFrame({'DateFull':pd.date_range(df.DateFull.min(), df.DateFull.max().floor('d') + pd.to_timedelta('23:45:00'), freq='15T')})

#Create columns with Hour and Interval based on new Date Range
df1['Hour'] = df1.DateFull.dt.hour
df1['Interval'] = df1.DateFull.dt.minute

#Merge on DateFull, Hour, Interval to get the full set merged with original DF
df1 = df1.merge(df, how='left', on=['DateFull','Hour','Interval'])

#forward fill Date, Source and ID 
df1[['Date','Source','ID']] = df1[['Date','Source','ID']].ffill()

#convert Source and ID to int
df1[['Source','ID']] = df1[['Source','ID']].astype(int)

#Drop DateFull as it is no longer needed
df1.drop(columns ='DateFull',inplace=True)

#Reset index to original column 
df1 = df1.reindex(c, axis=1)

print (df1)

Original DataFrame:

        Date  Hour  Interval  Source         ID  Number of Messages
0 2020-05-19     1         0       1  413379290                  23
1 2020-05-19     1        15       1  413379290                  36
2 2020-05-19     1        30       1  413379290                  31
3 2020-05-19     1        45       1  413379290                  14
4 2020-05-19     2         0       1  413379290                   3
5 2020-05-20    21        45       1  413379290                   6
6 2020-05-20    22         0       1  413379290                   8
7 2020-05-20    22        15       1  413379290                   4
8 2020-05-20    22        30       1  413379290                   6
9 2020-05-20    22        45       1  413379290                   9

Final DataFrame:

          Date  Hour  Interval  Source         ID  Number of Messages
0   2020-05-19     1         0       1  413379290                23.0
1   2020-05-19     1        15       1  413379290                36.0
2   2020-05-19     1        30       1  413379290                31.0
3   2020-05-19     1        45       1  413379290                14.0
4   2020-05-19     2         0       1  413379290                 3.0
..         ...   ...       ...     ...        ...                 ...
183 2020-05-20    22        45       1  413379290                 9.0
184 2020-05-20    23         0       1  413379290                 NaN
185 2020-05-20    23        15       1  413379290                 NaN
186 2020-05-20    23        30       1  413379290                 NaN
187 2020-05-20    23        45       1  413379290                 NaN

df1.tail(20) gives you this:

          Date  Hour  Interval  Source         ID  Number of Messages
168 2020-05-19    19         0       1  413379290                 NaN
169 2020-05-19    19        15       1  413379290                 NaN
170 2020-05-19    19        30       1  413379290                 NaN
171 2020-05-19    19        45       1  413379290                 NaN
172 2020-05-19    20         0       1  413379290                 NaN
173 2020-05-19    20        15       1  413379290                 NaN
174 2020-05-19    20        30       1  413379290                 NaN
175 2020-05-19    20        45       1  413379290                 NaN
176 2020-05-19    21         0       1  413379290                 NaN
177 2020-05-19    21        15       1  413379290                 NaN
178 2020-05-19    21        30       1  413379290                 NaN
179 2020-05-20    21        45       1  413379290                 6.0
180 2020-05-20    22         0       1  413379290                 8.0
181 2020-05-20    22        15       1  413379290                 4.0
182 2020-05-20    22        30       1  413379290                 6.0
183 2020-05-20    22        45       1  413379290                 9.0
184 2020-05-20    23         0       1  413379290                 NaN
185 2020-05-20    23        15       1  413379290                 NaN
186 2020-05-20    23        30       1  413379290                 NaN
187 2020-05-20    23        45       1  413379290                 NaN

Comments

1

You can do this by creating a new dataframe with all the appropriate values for the first five columns and then merge with the original dataframe to get the values from the No of Messages column in the appropriate row.

import pandas as pd

df = pd.read_csv('test.csv')

dates = df['Date'].unique()

hrs = [hr for hr in range(24) for i in range(4)]*len(dates)

intervals = [0, 15, 30, 45] * 24 *len(dates)

new_df = pd.DataFrame()
new_df['Date'] = [dt for dt in dates for i in range(24*4)]

new_df['Hour'] = hrs
new_df['Interval'] = intervals
new_df['Source'] = df['Source'].iloc[0]
new_df['ID'] = str(df['ID'].iloc[0])

new_df = new_df.merge(df, how='left', on=['Date', 'Hour', 'Interval']).drop(['Source_y', 'ID_y'], axis=1)
new_df.rename(columns={'Source_x':'Source', 'ID_x':'ID'},inplace=True)

new_df.to_excel('testit.xlsx')

7 Comments

I tried this and I only got 4 entire rows of NaN. Is there any way to keep the date interval, source, and id consistent with the rest of the data frame append 23 in the hour column as seen in my post?
I've updated my answer and what I posted does work but there's probably a better way to do it.
You can add a foreward fill df = df.ffill(). This would fill all columns with the last value. In your case you maybe want to do this only on a few colums.
@nan I've update the code so it should handle missing rows any where in the original dataframe. It could probably be improved though , for example what if there's more than 2 days of data?
I've updated the code again to handle an arbitrary no of dates.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.