Adding missing rows of pandas DataFrame when index contains duplicate data

Question

I have a DataFrame with dtype=object as:

          YY    MM  DD  hh  var1    var2
.
.
.

10512   2013    01  01  06  1.64    4.64
10513   2013    01  01  07  1.57    4.63
10514   2013    01  01  08  1.56    4.71
10515   2013    01  01  09  1.45    4.69
10516   2013    01  01  10  1.53    4.67
10517   2013    01  01  11  1.31    4.63
10518   2013    01  01  12  1.41    4.70
10519   2013    01  01  13  1.49    4.80
10520   2013    01  01  20  1.15    4.91
10521   2013    01  01  21  1.14    4.74
10522   2013    01  01  22  1.10    4.95

As seen, there are missing rows corresponding to hours (hh) (for instance between 10519 and 10520 rows, hh jumps from 13 to 20). I tried to add the gap by setting hh as index, as what was discussed here: Missing data, insert rows in Pandas and fill with NAN

df=df.set_index('hh')
new_index = pd.Index(np.arange(0,24), name="hh")
df=df.reindex(new_index).reset_index()

and reach something like:

          YY    MM  DD  hh  var1    var2

10519   2013    01  01  13  1.49    4.80
10520   2013    01  01  14  Nan     Nan
10521   2013    01  01  15  Nan     Nan
10522   2013    01  01  16  Nan     Nan
...
10523   2013    01  01  20  1.15    4.91
10524   2013    01  01  21  1.14    4.74
10525   2013    01  01  22  1.10    4.95

But I encounter the error "cannot reindex from a duplicate axis" for the part df=df.reindex(new_index). There are duplicate values for each hh=0,1,...,23, because same value of hh would be repeated for different months (MM) and years (YY). Probably that's the reason. How can I solve the problem?

In general,how can one fills the missing rows of pandas DataFrame when index contains duplicate data. I appreciate any comments.

where exactly is the missing data, I see no NaN for hh between 10519 and 10520? — Khalil Al Hooti
– Khalil Al Hooti, Commented Sep 18, 2018 at 3:30
@KhalilAlHooti data has a hourly resolution but it jumps from 13 to 20. I want to insert the missing rows ( 14, 15, 16,...) with Nan values for var1 and var 2 so that I can calculate Nans afterwards with simple techniques such as interpolatation. — Pierre Hoshyar
– Pierre Hoshyar, Commented Sep 18, 2018 at 3:39

Shaido · Accepted Answer · 2018-09-18 03:53:10Z

1

First create a new column with the time, including date and hour, of type datetime. One way this can be done is as follows:

df = df.rename(columns={'YY': 'year', 'MM': 'month', 'DD': 'day', 'hh': 'hour'})
df['time'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])

To use to_datetime in this way, the column names need to be year, month, day and hour which is why rename is used.

To get the expected result, set this new column as the index and use resample:

df.set_index('time').resample('H').mean()

answered Sep 18, 2018 at 3:53

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Pierre Hoshyar Over a year ago

Thanks for your time. It solves the problem. The only thing is that the dtype needed to be numeric, not an object, which one can easily change.

Shaido Over a year ago

@PierreHoshyar: Happy to help. Yes, actually to use the resample method the columns themselves need to be numeric, but it can be solved by using agg instead of mean, see stackoverflow.com/questions/47613521/…

Khalil Al Hooti · Accepted Answer · 2018-09-18 07:38:28Z

0

This code does exactly what you need.

import pandas as pd
import numpy as np
from io import StringIO

YY, MM, DD, hh, var1, var2 = [],[],[],[],[],[]


a = '''10512   2013    01  01  06  1.64    4.64
10513   2013    01  01  07  1.57    4.63
10514   2013    01  01  08  1.56    4.71
10515   2013    01  01  09  1.45    4.69
10516   2013    01  01  10  1.53    4.67
10517   2013    01  01  11  1.31    4.63
10518   2013    01  01  12  1.41    4.70
10519   2013    01  01  13  1.49    4.80
10520   2013    01  01  20  1.15    4.91
10521   2013    01  01  21  1.14    4.74
10522   2013    01  01  22  1.10    4.95
10523   2013    01  01  27  1.30    4.55
10524   2013    01  01  28  1.2     4.62
'''

text = StringIO(a)

for line in text.readlines():
    a = line.strip().split(" ")
    a = list(filter(None, a))
    YY.append(a[1])
    MM.append(a[2])
    DD.append(a[3])
    hh.append(a[4])
    var1.append(a[5])
    var2.append(a[6])

df = pd.DataFrame({'YY':YY, 'MM':MM, 'DD':DD,
                   'hh':hh, 'var1':var1, 'var2':var2})

df['hh'] = df.hh.astype(int)


a = np.diff(df.hh)
b = np.where(a!=1)


df2 = df.copy(deep=True)

for i in range(len(df)):

    if (i in b[0]):
        line = pd.DataFrame(columns=['YY', 'MM', 'DD',
                                     'hh', 'var1', 'var2'])
        for k in range(a[i]-1):

            line.loc[k]=[df2.iloc[i, 0], df2.iloc[i, 1],
                         df2.iloc[i, 2], df2.iloc[i, 3]+k+1 ,
                         np.nan, np.nan]

        df = pd.concat([df.loc[:i], 
                line, df.loc[i+1:]])


df.reset_index(inplace=True, drop=True)

print(df)

      YY  MM  DD  hh  var1  var2
0   2013  01  01   6  1.64  4.64
1   2013  01  01   7  1.57  4.63
2   2013  01  01   8  1.56  4.71
3   2013  01  01   9  1.45  4.69
4   2013  01  01  10  1.53  4.67
5   2013  01  01  11  1.31  4.63
6   2013  01  01  12  1.41  4.70
7   2013  01  01  13  1.49  4.80
8   2013  01  01  14   NaN   NaN
9   2013  01  01  15   NaN   NaN
10  2013  01  01  16   NaN   NaN
11  2013  01  01  17   NaN   NaN
12  2013  01  01  18   NaN   NaN
13  2013  01  01  19   NaN   NaN
14  2013  01  01  20  1.15  4.91
15  2013  01  01  21  1.14  4.74
16  2013  01  01  22  1.10  4.95
17  2013  01  01  23   NaN   NaN
18  2013  01  01  24   NaN   NaN
19  2013  01  01  25   NaN   NaN
20  2013  01  01  26   NaN   NaN
21  2013  01  01  27  1.30  4.55
22  2013  01  01  28   1.2  4.62

answered Sep 18, 2018 at 7:38

Khalil Al Hooti

4,6065 gold badges28 silver badges51 bronze badges

3 Comments

Pierre Hoshyar Over a year ago

Thank you for your time. This also works. Just out of curiosity, those for loops may be time consuming when DataFrame is huge correct?

Khalil Al Hooti Over a year ago

Probably yes. The inner for loop is the most time consuming especially if there are many samples missing. There is another drawback with the code. if for example hours data are missing between consecutive days, the code will not work except if grouping per day is applied before for loops which is simple to do.

Pierre Hoshyar Over a year ago

Yes you mean there is no hour sample for more than one day. I thought of that, but I think the data is not empty for more than some hours. Thanks for your time.

Collectives™ on Stack Overflow

Adding missing rows of pandas DataFrame when index contains duplicate data

2 Answers 2

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related