3

I have a file with the following format:

SET, 0, 0, 0, 6938987, 0, 4
SET, 1, 1, 6938997, 128, 0, 0
SET, 2, 4, 6938998, 145, 0, 2
SET, 0, 9, 6938998, 147, 0, 0
SET, 1, 11, 6938998, 149, 0, 0
....
SET, 1, 30, 6946103, 6, 0, 0
SET, 2, 30, 6946104, 6, 0, 2
GET, 0, 30, 6946104, 8, 0, 0
SET, 1, 30, 6946105, 8, 0, 0
GET, 2, 30, 6946106, 7, 0, 0

The 5th column represents ms that I measure from a system (converted from Java's System.nanoTime()). Therefore these don't represent any Date/Time format. I want to aggregate on intervals of 5s, so for example from the first 6938987 to 6943987: get the value counts of SET/GET, get the averages, standard deviations and so on.

I've tried using data.resample in various ways but continue getting the following error:

data = pd.read_csv('data2.log', sep=", ", header=None)
data.columns = ["command", "server", "lenQueue", "inQueue", "diffQueue", "diffParse", "diffProcess"]
r = data.resample("5ms", on='inQueue')

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'

Is there any way to do resample with a difference of value instead of a Time series?

Edit - solution suggested by JohnE:

Transformed ms in a timedelta, then resampled to 5ms:

data['td'] = pd.to_timedelta(data['inQueue'], 'ms')
data['sum'] = data.set_index(data['td'])['lenQueue'].resample('5ms').sum()

[Other columns ommitted]
                   td  sum  
0            00:00:00  NaN  
1     01:55:38.997000  NaN  
2     01:55:38.998000  NaN  
3     01:55:38.998000  NaN  
4     01:55:38.998000  NaN  
5     01:55:38.998000  NaN  
6     01:55:38.999000  NaN  

Could it be because there are other columns that must also have some aggregation done to them? If so, how can I do it multiple times?

1 Answer 1

6

The error message is telling you that you need to convert to a datetime-like format, so you need to do that!

A fairly easy way is to convert to a timedelta rather than timestamp, which you can do as follows. First let's use a simpler version of your data:

In [143]: df
Out[143]: 
   val       ms       
0   11  6938987
1   22  6938997
2   33  6938998

Then make a new column "td" that represents the timedelta in milliseconds, "ms". (If you wanted microseconds, use "us" instead):

In [144]: df['td'] = pd.to_timedelta( df['ms'],'ms')

In [145]: df
Out[145]: 
   val       ms              td
0   11  6938987 01:55:38.987000
1   22  6938997 01:55:38.997000
2   33  6938998 01:55:38.998000

Then you can easily use resample. Note that you need to follow resample with some operation (e.g. sum, max, mean, etc.). Here I'll go with sum:

In [146]: df.set_index(df['td'])['val'].resample('5ms').sum()
Out[146]: 
td
01:55:38.987000    11.0
01:55:38.992000     NaN
01:55:38.997000    55.0
Freq: 5L, Name: val, dtype: float64
Sign up to request clarification or add additional context in comments.

4 Comments

That makes sense, thanks! I've tried to make it work with my dataset, but am only getting NaN's as aggregation result. I've updated my original post, if you could please have a look and have any suggestions
@dtam might just be the frequency? Try a bigger value like '5s'? The nan's are just saying there are no values in the given intervals.
I've tried with larger intervals but having the same issue. I went back to your example and tried the following line, to have a new 'sum' column with the result: df['sum'] = df.set_index(df['td'])['lenQueue'].resample('5ms').sum() This too gives me back all NaNs.
@dtam sounds like data issue. Does the specific column you are summing contain non-missing numbers? I.e. not all n/a values and dtype is int or float. I mean... aside from the frequency choice and the data type/values being aggregated, I don't really know what else could be the issue

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.