Inserting missing numbers in dataframe

Question

I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.

So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:

If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.

So I guess a training set could be like this:

index   time   temp 
0       0      20.10
1       1      20.20
2       2      20.20
3       4      20.10
4       100    22.30

Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.

How would I go about programming this?

Rocky Li · Accepted Answer · 2018-11-01 16:46:35Z

Use merge with complete time column and then interpolate:

# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']

>>> temps

   time  temperature
0   4.0    21.662352
1  10.0    20.904659
2  15.0    20.345858
3  18.0    24.787389
4  19.0    20.719487

The above is a random table generated with missing time data.

# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.

# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()

>>> merged # or final

    time  temperature
0    4.0    21.662352
1    5.0    21.536070
2    6.0    21.409788
3    7.0    21.283505
4    8.0    21.157223
5    9.0    21.030941
6   10.0    20.904659
7   11.0    20.792898
8   12.0    20.681138
9   13.0    20.569378
10  14.0    20.457618
11  15.0    20.345858
12  16.0    21.826368
13  17.0    23.306879
14  18.0    24.787389
15  19.0    20.719487

Swier · Accepted Answer · 2018-11-01 16:22:37Z

First you can set the second values to actual time values as such:

df.index = pd.to_datetime(df['time'], unit='s')

After which you can use pandas' built-in time series operations to resample and fill in the missing values:

df = df.resample('s').interpolate('time')

Optionally, if you still want to do some smoothing you can use the following operation for that:

df.rolling(5, center=True, win_type='hann').mean()

Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.

Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:

df.index = df.index.time

Collectives™ on Stack Overflow

Inserting missing numbers in dataframe

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related