I have a data file which contains energy use per second per train but the application which generated the data file eliminated all the rows with 0 for energy use, and I need to recreate those rows.
Specifically the need is: For every train number, ensure there is at least one record per second, using 0 for energy if the record has to be added.
My initial DataFrame looks like this (seconds is a timestamp in seconds-since-midnight):
train seconds energy
0 1024 13980 105.0000
1 1024 14745 114.0000
2 1024 14746 127.0100
3 1024 14747 137.5667
... ... ... ...
4284449 7564 95495 -301.6824
4284450 7564 95496 -181.0630
4284451 7564 95497 -60.3713
Notice that there is a time gap between row 0 and row 1 of 14745-13980 = 765 seconds. As far as we know, the only gaps in the per-second records for every train are between the first and second records, and you can tell how many are missing from the difference in the seconds values. But since I need to have one row for every missing second for every train, it is best not to assume the only missing values are between the first and second records.
My plan is:
- Group by train to get the min and max seconds per train
- Generate a new DataFrame from the Product of Train & arange(min, max), for each train. This will give me a DataFrame with every train and every second-per-train, and no energy column.
- Do a New DataFrome left merge of the Original Dataframe, which will have NA values for energy for any row that wasn't there before
- Replace all NA energy values with 0, and I'm done.
Step 1:
# Get the minimum and maximum seconds value per train
df = df_datafile.groupby(['train'])['seconds'].agg(['min', 'max']).rename(
columns={'min': 'minsec', 'max': 'maxsec'})
which results in:
minsec maxsec
train
1001 21923 25302
1003 22825 26197
1005 23736 27207
1007 24620 28009
1011 25548 28889
... ... ...
VIAE858 52785 53380
VIAE87 53442 54262
VIAE88 83204 85785
VIAE97 21942 27054
VIAE98 71123 73186
Step 2:
# Create one (train, second) record for every second of every train
df = DataFrame([product(*[[train], arange(minsec, maxsec)])
for train, minsec, maxsec in list(zip(df.index, df.minsec, df.maxsec))])
which results in:
0 1 2 ... 35403 35404 35405
0 (1001, 21923) (1001, 21924) (1001, 21925) ... None None None
1 (1003, 22825) (1003, 22826) (1003, 22827) ... None None None
2 (1005, 23736) (1005, 23737) (1005, 23738) ... None None None
3 (1007, 24620) (1007, 24621) (1007, 24622) ... None None None
4 (1011, 25548) (1011, 25549) (1011, 25550) ... None None None
... ... ... ... ... ... ... ...
2561 (VIAE858, 52785) (VIAE858, 52786) (VIAE858, 52787) ... None None None
2562 (VIAE87, 53442) (VIAE87, 53443) (VIAE87, 53444) ... None None None
2563 (VIAE88, 83204) (VIAE88, 83205) (VIAE88, 83206) ... None None None
2564 (VIAE97, 21942) (VIAE97, 21943) (VIAE97, 21944) ... None None None
2565 (VIAE98, 71123) (VIAE98, 71124) (VIAE98, 71125) ... None None None
[2566 rows x 35406 columns]
All the None values are due to the fact that the longest train was 35406 seconds long and all the other records in the Dataframe have to match that row's number of columns. These None values need to be eliminated.
But now I'm stuck. What I want to get to is:
train seconds
0 1001 21923
1 1001 21924
2 1001 21925
... ... ...
??? VIAE98 71123
??? VIAE98 71124
??? VIAE98 71125
Effectively, each individual row has been transposed (with the individual lists expanded and null elements eliminated) and all of the transposed rows have been concatentated into one long dataframe of 2 columns.
Can you help me with this last step and/or give me some other way to do my original problem statement (in italics at the beginning).
Thank-you very much for any help. I really appreciate all those who answer StackOverflow questions.
Mark Batten-Carew