Generating rows in a pandas dataframe to make up for missing values of a column (or multiple columns)

Question

I have the following dataframe.

   hour sensor_id hourly_count 
0     1       101          651
1     1       102           19
2     2       101          423
3     2       102           12
4     3       101          356
5     4       101           79
6     4       102           21
7     5       101          129
8     6       101          561

Notice that for sensor_id 102, there are no values for hour = 3. This is due to the fact that the sensors do not generate a separate row of data if the hourly_count is equal to zero. This means that sensor 102 should have hourly_counts = 0 at hour = 3, but this is just the way the original data was collected.

I would ideally wish for a code that fills in this gap. So it should understand that if there are 2 sensors, each sensor should have an hourly record, and if not, insert a row in the dataframe for that sensor for that hour and fill the hourly_count column at that row as 0.

   hour sensor_id hourly_count 
0     1       101          651
1     1       102           19
2     2       101          423
3     2       102           12
4     3       101          356
5     3       102            0
6     4       101           79
7     4       102           21
8     5       101          129
9     5       102            0
10    6       101          561
11    6       102            0

Any help is really appreciated.

busybear · Accepted Answer · 2019-08-12 04:47:22Z

Using DataFrame.reindex, you can explicitly define your index. This is useful if you are missing data from both sensors for a particular hour. You can also extend the hour beyond what you have. In the following example, it extends out to hour 8.

new_ix = pd.MultiIndex.from_product([range(1,9), [101, 102]], names=['hour', 'sensor_id'])
df_new = df.set_index(['hour', 'sensor_id'])
df_new.reindex(new_ix, fill_value=0).reset_index()

Output:

    hour  sensor_id  hourly_count
0      1        101           651
1      1        102            19
2      2        101           423
3      2        102            12
4      3        101           356
5      3        102             0
6      4        101            79
7      4        102            21
8      5        101           129
9      5        102             0
10     6        101           561
11     6        102             0
12     7        101             0
13     7        102             0
14     8        101             0
15     8        102             0

Chris · Accepted Answer · 2019-08-12 04:26:39Z

1

Use pandas.DataFrame.pivot and then unstack with reset_index:

new_df = df.pivot('sensor_id','hour', 'hourly_count').fillna(0).unstack().reset_index()
print(new_df)

Output:

    hour  sensor_id      0
0      1        101  651.0
1      1        102   19.0
2      2        101  423.0
3      2        102   12.0
4      3        101  356.0
5      3        102    0.0
6      4        101   79.0
7      4        102   21.0
8      5        101  129.0
9      5        102    0.0
10     6        101  561.0
11     6        102    0.0

answered Aug 12, 2019 at 4:26

Chris

29.8k3 gold badges34 silver badges56 bronze badges

Comments

Andy L. · Accepted Answer · 2019-08-12 04:28:48Z

Assume missing is on sensor_id 2 only. One way is you just create a new df with all combination of all hours of sensor_id 1, and merge left this new df with original df to get hourly_count and fillna

a = df.hour.unique()
Idf1 = pd.MultiIndex.from_product([a, [101, 102]]).to_frame(index=False, name=['hour', 'sensor_id'])

Out[157]:
    hour  sensor_id
0      1        101
1      1        102
2      2        101
3      2        102
4      3        101
5      3        102
6      4        101
7      4        102
8      5        101
9      5        102
10     6        101
11     6        102

df1.merge(df, on=['hour','sensor_id'], how='left').fillna(0)

Out[161]:
    hour  sensor_id  hourly_count
0      1        101         651.0
1      1        102          19.0
2      2        101         423.0
3      2        102          12.0
4      3        101         356.0
5      3        102           0.0
6      4        101          79.0
7      4        102          21.0
8      5        101         129.0
9      5        102           0.0
10     6        101         561.0
11     6        102           0.0

Other way: using unstack with fill_value

df.set_index(['hour', 'sensor_id']).unstack(fill_value=0).stack().reset_index()

Out[171]:
    hour  sensor_id  hourly_count
0      1        101           651
1      1        102            19
2      2        101           423
3      2        102            12
4      3        101           356
5      3        102             0
6      4        101            79
7      4        102            21
8      5        101           129
9      5        102             0
10     6        101           561
11     6        102             0

Collectives™ on Stack Overflow

Generating rows in a pandas dataframe to make up for missing values of a column (or multiple columns)

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related