0

I have a df like this:

time  units   cost
0      4       10
1      2       10
3      4       20
4      1       20
5      3       10
6      1       20
9      2       10

As you can see, df.time is not consecutive. If there is a missing value, I want to add a new row, populating df.time with the consecutive time value, df.units with 2 and df.cost with 20. Expected output:

time  units   cost
0      4       10
1      2       10
2      2       20
3      4       20
4      1       20
5      3       10
6      1       20
7      2       20
8      2       20
9      2       10

How do I do this? I understand how to this by deconstructing all series into lists, looping through them and appending values when time is not equal to time - 1, but this seems inefficient.

3 Answers 3

4

You can use the reindex method with a call to fillna to do this:

# Build new index that ranges from time min to time max with a step of 1
new_index = range(df["time"].min(), df["time"].max() + 1)


out = (df.set_index("time")                # Index our dataframe with the original time column
         .reindex(new_index)               # Reindex our dataframe with the new_index, all empty cells appear as nan
         .fillna({"units": 2, "cost": 20}) # Fill in the nans for units and cost with 2 and 20 respectively
         .astype(int))                     # Due to NaNs that were in column from reindexing, we'll manually recast our
                                           #   data type from float to int (not necessary, but produces cleaner output)

print(out)
      units  cost
time             
0         4    10
1         2    10
2         2    20
3         4    20
4         1    20
5         3    10
6         1    20
7         2    20
8         2    20
9         2    10
Sign up to request clarification or add additional context in comments.

1 Comment

fillna take dict forgot about it. +1
1

You can use df.reindex, then pd.Series.fillna.

idx = pd.RangeIndex(df['time'].min(), df['time'].max()+1) 
# If `df.time` is always sorted then,
# idx = pd.RangeIndex(df['time'].iat[0], df['time'].iat[-1]+1)

df = df.set_index('time')
df = df.reindex(idx)
df['units'] = df['units'].fillna(2).astype(int)
df['cost'] = df['cost'].fillna(20).astype(int)

# if you prefer not to hard-code the names of the columns, replace last
# the two lines with:
#   defaults = [2,20]
#   for (name, default) in zip(df.columns, defaults):
#       df[name] = df[name].fillna(default).astype(type(default))

      units  cost
time             
0         4    10
1         2    10
2         2    20
3         4    20
4         1    20
5         3    10
6         1    20
7         2    20
8         2    20
9         2    10

1 Comment

Going to edit this with a suggestion as a comment - feel free to edit further to either incorporate this into the actual code or to undo my edit, as you see fit...
0

You can construct new DataFrame with complete "time" column and then do .fillna() from original dataframe (df is your original dataframe):

r = range(df['time'].min(), df['time'].max()+1)
df_out = pd.DataFrame({'time': r, 'units': [np.nan]*len(r), 'cost': [np.nan]*len(r)}).set_index('time')

df_out = df_out.fillna(df.set_index('time'))
df_out['units'] = df_out['units'].fillna(2).astype(int)
df_out['cost'] = df_out['cost'].fillna(20).astype(int)

print(df_out)

Prints:

      units  cost
time             
0         4    10
1         2    10
2         2    20
3         4    20
4         1    20
5         3    10
6         1    20
7         2    20
8         2    20
9         2    10

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.