2

I have a data series with a random date column as my index, a numbered value as well as three columns that each indicate whether a safety mechanism is activated to block the numbered value. Example is:

DateTime         Safe1    Safe2    Safe3    Measurement

1/8/2013 6:06     N       Y        N    

1/8/2013 6:23     N       Y        N    

1/8/2013 6:40     N       N        N        28

1/8/2013 6:57     N        N       N        31

I need to resample the data using Pandas in order to create clean half-hour interval data, taking the mean of values where any exist. Of course, this removes the three safety string columns.

However, I would like to include a column that indicates Y if any combination of the safety mechanisms are activated for the entire half-hour interval.

How do I get this string column showing Y in the resampled data indicating a Y was present in the raw data amongst the three safety mechanism columns without any values in the Measurement?

Desired Output based upon above:

DateTime      Safe1  Measurement

1/8/2013 6:00 Y      

1/8/2013 6:30 N      29.5

3 Answers 3

3

I don't think it's possible to do what you want with the resample function, as there's not much customisation you can do. We have to do a TimeGrouper with a groupby operation.

First creating the data :

import pandas as pd

index = ['1/8/2013 6:06', '1/8/2013 6:23', '1/8/2013 6:40', '1/8/2013 6:57']

data = {'Safe1' : ['N', 'N', 'N', 'N'], 
        'Safe2': ['Y', 'Y', 'N', 'N'], 
        'Safe3': ['N', 'N', 'N', 'N'], 
        'Measurement': [0,0,28,31]}

df = pd.DataFrame(index=index, data=data)
df.index = pd.to_datetime(df.index)
df

output :

                     Measurement Safe1 Safe2 Safe3
2013-01-08 06:06:00            0     N     Y     N
2013-01-08 06:23:00            0     N     Y     N
2013-01-08 06:40:00           28     N     N     N
2013-01-08 06:57:00           31     N     N     N

Then let's add a helper column, called Safe, that will be a concatenation of all the Safex columns. If there's at least one Y in the Safe column, we'll know that the safety mechanism was activated.

df['Safe'] = df['Safe1'] + df['Safe2'] + df['Safe3']
print df

output :

                     Measurement Safe1 Safe2 Safe3 Safe
2013-01-08 06:06:00            0     N     Y     N  NYN
2013-01-08 06:23:00            0     N     Y     N  NYN
2013-01-08 06:40:00           28     N     N     N  NNN
2013-01-08 06:57:00           31     N     N     N  NNN

finally, we're going to define a custom function, that will return Y if there's at least one Y in the list of strings that is passed as an argument.

That custom function is passed on the Safe column, after we have grouped it by 30 Min intervals :

def func(x):
    x = ''.join(x.values)
    return 'Y' if 'Y' in x else 'N'

df.groupby(pd.TimeGrouper(freq='30Min')).agg({'Measurement': 'mean', 'Safe': func })

output :

                    Safe  Measurement
2013-01-08 06:00:00    Y          0.0
2013-01-08 06:30:00    N         29.5
Sign up to request clarification or add additional context in comments.

1 Comment

pd.TimeGrouper was renamed pd.Grouper... and in my case the column agg-column names had to exist beforehand (compare stackoverflow.com/questions/60229375/…). Other than that it still works like a charm :)
2

Here's an answer using pandas built-in resample function.

First combine the 3 Safe values into a single column:

df['Safe'] = df.Safe1 + df.Safe2 + df.Safe3

Turn the 3-letter strings into a 0-1 variable:

df.Safe = df.Safe.apply(lambda x: 1 if 'Y' in x else 0)

Write a custom resampling function for the 'Safes' column:

def f(x):
  if sum(x) > 0: return 'Y'
  else: return 'N'

Finally, resample:

df.resample('30T').Safe.agg({'Safe': f}).join(df.resample('30T').Measurement.mean())

Output:

                    Safe  Measurement 
2013-01-08 06:00:00  Y        0.0
2013-01-08 06:30:00  N        29.5

Comments

0

I manually resample the date (easy if it is rounding)....

Here is an example

from random import shuffle
from datetime import datetime, timedelta
from itertools import zip_longest
from random import randint, randrange, seed
from tabulate import tabulate
import pandas as pd

def df_to_md(df):
    print(tabulate(df, tablefmt="pipe",headers="keys"))

seed(42)

people=['tom','dick','harry']
avg_score=[90,50,10]
date_times=[n for n in pd.date_range(datetime.now()-timedelta(days=2),datetime.now(),freq='5 min').values]
scale=1+int(len(date_times)/len(people))
score =[randint(i,100)*i/10000 for i in avg_score*scale]

df=pd.DataFrame.from_records(list(zip(date_times,people*scale,score)),columns=['When','Who','Status'])
# Create 3 records tom should score 90%, dick 50% and poor harry only 10% 
# Tom should score well
df_to_md(df[df.Who=='tom'].head())

The table is in Markdown format - just to easy my cut and paste....

|    | When                       | Who   |   Status |
|---:|:---------------------------|:------|---------:|
|  0 | 2019-06-18 14:07:17.457124 | tom   |    0.9   |
|  3 | 2019-06-18 14:22:17.457124 | tom   |    0.846 |
|  6 | 2019-06-18 14:37:17.457124 | tom   |    0.828 |
|  9 | 2019-06-18 14:52:17.457124 | tom   |    0.9   |
| 12 | 2019-06-18 15:07:17.457124 | tom   |    0.819 |

Harry scores badly

df_to_md(df[df.Who=='harry'].head())
|    | When                       | Who   |   Status |
|---:|:---------------------------|:------|---------:|
|  2 | 2019-06-18 14:17:17.457124 | harry |    0.013 |
|  5 | 2019-06-18 14:32:17.457124 | harry |    0.038 |
|  8 | 2019-06-18 14:47:17.457124 | harry |    0.023 |
| 11 | 2019-06-18 15:02:17.457124 | harry |    0.079 |
| 14 | 2019-06-18 15:17:17.457124 | harry |    0.064 |

Lets get the average per hour per person

def round_to_hour(t):
    # Rounds to nearest hour by adding a timedelta hour if minute >= 30
    return (t.replace(second=0, microsecond=0, minute=0, hour=t.hour)
               +timedelta(hours=t.minute//30))

And generate a new column using this method.

df['WhenRounded']=df.When.apply(lambda x: round_to_hour(x))
df_to_md(df[df.Who=='tom'].head())

This should be tom's data - showing original and rounded.

|    | When                       | Who   |   Status | WhenRounded         |
|---:|:---------------------------|:------|---------:|:--------------------|
|  0 | 2019-06-18 14:07:17.457124 | tom   |    0.9   | 2019-06-18 14:00:00 |
|  3 | 2019-06-18 14:22:17.457124 | tom   |    0.846 | 2019-06-18 14:00:00 |
|  6 | 2019-06-18 14:37:17.457124 | tom   |    0.828 | 2019-06-18 15:00:00 |
|  9 | 2019-06-18 14:52:17.457124 | tom   |    0.9   | 2019-06-18 15:00:00 |
| 12 | 2019-06-18 15:07:17.457124 | tom   |    0.819 | 2019-06-18 15:00:00 |

We can resample ... by grouping and using a grouping function

Group by the Rounded-Date, and the Person (Datetime and Str) objects) - we want in this case the mean value, but there are others also available.


df_resampled=df.groupby(by=['WhenRounded','Who'], axis=0).agg({'Status':'mean'}).reset_index()
# Output in Markdown format
df_to_md(df_resampled[df_resampled.Who=='tom'].head())
|    | WhenRounded         | Who   |   Status |
|---:|:--------------------|:------|---------:|
|  2 | 2019-06-18 14:00:00 | tom   |  0.873   |
|  5 | 2019-06-18 15:00:00 | tom   |  0.83925 |
|  8 | 2019-06-18 16:00:00 | tom   |  0.86175 |
| 11 | 2019-06-18 17:00:00 | tom   |  0.84375 |
| 14 | 2019-06-18 18:00:00 | tom   |  0.8505  |

Lets check the mean for tom @ 14:00

print("Check tom 14:00 .86850  ... {:6.5f}".format((.900+.846+.828+.900)/4))
Check tom 14:00 .86850  ... 0.86850

Hope this assists

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.