Pandas Dataframe resample on ms values

Question

I have a file with the following format:

SET, 0, 0, 0, 6938987, 0, 4
SET, 1, 1, 6938997, 128, 0, 0
SET, 2, 4, 6938998, 145, 0, 2
SET, 0, 9, 6938998, 147, 0, 0
SET, 1, 11, 6938998, 149, 0, 0
....
SET, 1, 30, 6946103, 6, 0, 0
SET, 2, 30, 6946104, 6, 0, 2
GET, 0, 30, 6946104, 8, 0, 0
SET, 1, 30, 6946105, 8, 0, 0
GET, 2, 30, 6946106, 7, 0, 0

The 5th column represents ms that I measure from a system (converted from Java's System.nanoTime()). Therefore these don't represent any Date/Time format. I want to aggregate on intervals of 5s, so for example from the first 6938987 to 6943987: get the value counts of SET/GET, get the averages, standard deviations and so on.

I've tried using data.resample in various ways but continue getting the following error:

data = pd.read_csv('data2.log', sep=", ", header=None)
data.columns = ["command", "server", "lenQueue", "inQueue", "diffQueue", "diffParse", "diffProcess"]
r = data.resample("5ms", on='inQueue')

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'

Is there any way to do resample with a difference of value instead of a Time series?

Edit - solution suggested by JohnE:

Transformed ms in a timedelta, then resampled to 5ms:

data['td'] = pd.to_timedelta(data['inQueue'], 'ms')
data['sum'] = data.set_index(data['td'])['lenQueue'].resample('5ms').sum()

[Other columns ommitted]
                   td  sum  
0            00:00:00  NaN  
1     01:55:38.997000  NaN  
2     01:55:38.998000  NaN  
3     01:55:38.998000  NaN  
4     01:55:38.998000  NaN  
5     01:55:38.998000  NaN  
6     01:55:38.999000  NaN

Could it be because there are other columns that must also have some aggregation done to them? If so, how can I do it multiple times?

JohnE · Accepted Answer · 2017-12-02 20:22:17Z

6

The error message is telling you that you need to convert to a datetime-like format, so you need to do that!

A fairly easy way is to convert to a timedelta rather than timestamp, which you can do as follows. First let's use a simpler version of your data:

In [143]: df
Out[143]: 
   val       ms       
0   11  6938987
1   22  6938997
2   33  6938998

Then make a new column "td" that represents the timedelta in milliseconds, "ms". (If you wanted microseconds, use "us" instead):

In [144]: df['td'] = pd.to_timedelta( df['ms'],'ms')

In [145]: df
Out[145]: 
   val       ms              td
0   11  6938987 01:55:38.987000
1   22  6938997 01:55:38.997000
2   33  6938998 01:55:38.998000

Then you can easily use resample. Note that you need to follow resample with some operation (e.g. sum, max, mean, etc.). Here I'll go with sum:

In [146]: df.set_index(df['td'])['val'].resample('5ms').sum()
Out[146]: 
td
01:55:38.987000    11.0
01:55:38.992000     NaN
01:55:38.997000    55.0
Freq: 5L, Name: val, dtype: float64

edited Dec 2, 2017 at 20:22

answered Dec 2, 2017 at 20:16

JohnE

30.7k9 gold badges86 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

dtam Over a year ago

That makes sense, thanks! I've tried to make it work with my dataset, but am only getting NaN's as aggregation result. I've updated my original post, if you could please have a look and have any suggestions

JohnE Over a year ago

@dtam might just be the frequency? Try a bigger value like '5s'? The nan's are just saying there are no values in the given intervals.

dtam Over a year ago

I've tried with larger intervals but having the same issue. I went back to your example and tried the following line, to have a new 'sum' column with the result: df['sum'] = df.set_index(df['td'])['lenQueue'].resample('5ms').sum() This too gives me back all NaNs.

JohnE Over a year ago

@dtam sounds like data issue. Does the specific column you are summing contain non-missing numbers? I.e. not all n/a values and dtype is int or float. I mean... aside from the frequency choice and the data type/values being aggregated, I don't really know what else could be the issue

Collectives™ on Stack Overflow

Pandas Dataframe resample on ms values

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related