How to create/populate missing rows in a DataFrame?

Question

I have a data file which contains energy use per second per train but the application which generated the data file eliminated all the rows with 0 for energy use, and I need to recreate those rows.

Specifically the need is: For every train number, ensure there is at least one record per second, using 0 for energy if the record has to be added.

My initial DataFrame looks like this (seconds is a timestamp in seconds-since-midnight):

           train  seconds    energy   
0           1024    13980  105.0000   
1           1024    14745  114.0000   
2           1024    14746  127.0100 
3           1024    14747  137.5667 
...          ...      ...       ...      
4284449     7564    95495 -301.6824   
4284450     7564    95496 -181.0630   
4284451     7564    95497  -60.3713

Notice that there is a time gap between row 0 and row 1 of 14745-13980 = 765 seconds. As far as we know, the only gaps in the per-second records for every train are between the first and second records, and you can tell how many are missing from the difference in the seconds values. But since I need to have one row for every missing second for every train, it is best not to assume the only missing values are between the first and second records.

My plan is:

Group by train to get the min and max seconds per train
Generate a new DataFrame from the Product of Train & arange(min, max), for each train. This will give me a DataFrame with every train and every second-per-train, and no energy column.
Do a New DataFrome left merge of the Original Dataframe, which will have NA values for energy for any row that wasn't there before
Replace all NA energy values with 0, and I'm done.

Step 1:

# Get the minimum and maximum seconds value per train   
df = df_datafile.groupby(['train'])['seconds'].agg(['min', 'max']).rename(
                         columns={'min': 'minsec', 'max': 'maxsec'})

which results in:

              minsec  maxsec
    train                
    1001       21923   25302
    1003       22825   26197
    1005       23736   27207
    1007       24620   28009
    1011       25548   28889
    ...          ...     ...
    VIAE858    52785   53380
    VIAE87     53442   54262
    VIAE88     83204   85785
    VIAE97     21942   27054
    VIAE98     71123   73186

Step 2:

# Create one (train, second) record for every second of every train 
df = DataFrame([product(*[[train], arange(minsec, maxsec)])
               for train, minsec, maxsec in list(zip(df.index, df.minsec, df.maxsec))])

which results in:

                 0                 1                 2      ... 35403 35404 35405
0        (1001, 21923)     (1001, 21924)     (1001, 21925)  ...  None  None  None
1        (1003, 22825)     (1003, 22826)     (1003, 22827)  ...  None  None  None
2        (1005, 23736)     (1005, 23737)     (1005, 23738)  ...  None  None  None
3        (1007, 24620)     (1007, 24621)     (1007, 24622)  ...  None  None  None
4        (1011, 25548)     (1011, 25549)     (1011, 25550)  ...  None  None  None
...                ...               ...               ...  ...   ...   ...   ...
2561  (VIAE858, 52785)  (VIAE858, 52786)  (VIAE858, 52787)  ...  None  None  None
2562   (VIAE87, 53442)   (VIAE87, 53443)   (VIAE87, 53444)  ...  None  None  None
2563   (VIAE88, 83204)   (VIAE88, 83205)   (VIAE88, 83206)  ...  None  None  None
2564   (VIAE97, 21942)   (VIAE97, 21943)   (VIAE97, 21944)  ...  None  None  None
2565   (VIAE98, 71123)   (VIAE98, 71124)   (VIAE98, 71125)  ...  None  None  None

[2566 rows x 35406 columns]

All the None values are due to the fact that the longest train was 35406 seconds long and all the other records in the Dataframe have to match that row's number of columns. These None values need to be eliminated.

But now I'm stuck. What I want to get to is:

         train seconds   
0        1001  21923 
1        1001  21924 
2        1001  21925
...      ...     ...
???    VIAE98  71123
???    VIAE98  71124
???    VIAE98  71125

Effectively, each individual row has been transposed (with the individual lists expanded and null elements eliminated) and all of the transposed rows have been concatentated into one long dataframe of 2 columns.

Can you help me with this last step and/or give me some other way to do my original problem statement (in italics at the beginning).

Thank-you very much for any help. I really appreciate all those who answer StackOverflow questions.

Mark Batten-Carew

Cainã Max Couto da Silva · Accepted Answer · 2020-12-10 02:12:36Z

1

You can use reindex taking the min and max of seconds per train group:

def populate_df(grp):
    grp = (grp.set_index('seconds')
           .reindex(range(grp.seconds.min(), grp.seconds.max()+1))
           .drop(columns='train')
           .fillna(0)
          )
    return grp

df.groupby('train').apply(populate_df).reset_index()

answered Dec 10, 2020 at 2:12

Cainã Max Couto da Silva

4,9691 gold badge15 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MarkBC Over a year ago

This worked perfectly, thank-you! Just in case any future reader runs into my next issue... Unfortunately, the first run failed because it turned out my data (in addition to missing some seconds) has some rows with duplicate seconds values, and the reindex command fails if there is a duplicated key. So the next thing I had to do was resolve duplicates, and then this code ran fine.

Collectives™ on Stack Overflow

How to create/populate missing rows in a DataFrame?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related