I have a file with the following format:
SET, 0, 0, 0, 6938987, 0, 4
SET, 1, 1, 6938997, 128, 0, 0
SET, 2, 4, 6938998, 145, 0, 2
SET, 0, 9, 6938998, 147, 0, 0
SET, 1, 11, 6938998, 149, 0, 0
....
SET, 1, 30, 6946103, 6, 0, 0
SET, 2, 30, 6946104, 6, 0, 2
GET, 0, 30, 6946104, 8, 0, 0
SET, 1, 30, 6946105, 8, 0, 0
GET, 2, 30, 6946106, 7, 0, 0
The 5th column represents ms that I measure from a system (converted from Java's System.nanoTime()). Therefore these don't represent any Date/Time format. I want to aggregate on intervals of 5s, so for example from the first 6938987 to 6943987: get the value counts of SET/GET, get the averages, standard deviations and so on.
I've tried using data.resample in various ways but continue getting the following error:
data = pd.read_csv('data2.log', sep=", ", header=None)
data.columns = ["command", "server", "lenQueue", "inQueue", "diffQueue", "diffParse", "diffProcess"]
r = data.resample("5ms", on='inQueue')
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'
Is there any way to do resample with a difference of value instead of a Time series?
Edit - solution suggested by JohnE:
Transformed ms in a timedelta, then resampled to 5ms:
data['td'] = pd.to_timedelta(data['inQueue'], 'ms')
data['sum'] = data.set_index(data['td'])['lenQueue'].resample('5ms').sum()
[Other columns ommitted]
td sum
0 00:00:00 NaN
1 01:55:38.997000 NaN
2 01:55:38.998000 NaN
3 01:55:38.998000 NaN
4 01:55:38.998000 NaN
5 01:55:38.998000 NaN
6 01:55:38.999000 NaN
Could it be because there are other columns that must also have some aggregation done to them? If so, how can I do it multiple times?