3

I have a time series dataframe where there is 1 or 0 in it (true/false). I wrote a function that loops through all rows with values 1 in them. Given user defined integer parameter called n_hold, I will set values 1 to n rows forward from the initial row.

For example, in the dataframe below I will be loop to row 2016-08-05. If n_hold = 2, then I will set both 2016-08-08 and 2016-08-09 to 1 too.:

2016-08-03    0
2016-08-04    0
2016-08-05    1
2016-08-08    0
2016-08-09    0
2016-08-10    0

The resulting df will then is

2016-08-03    0
2016-08-04    0
2016-08-05    1
2016-08-08    1
2016-08-09    1
2016-08-10    0

The problem I have is this is being run 10s of thousands of times and my current solution where I am looping over rows where there are ones and subsetting is way too slow. I was wondering if there are any solutions to the above problem that is really fast.

Here is my (slow) solution, x is the initial signal dataframe:

n_hold = 2
entry_sig_diff = x.diff()
entry_sig_dt = entry_sig_diff[entry_sig_diff == 1].index
final_signal = x * 0
for i in range(0, len(entry_sig_dt)):
    row_idx = entry_sig_diff.index.get_loc(entry_sig_dt[i])

    if (row_idx + n_hold) >= len(x):
        break

    final_signal[row_idx:(row_idx + n_hold + 1)] = 1
2
  • Can you show your allow approach for completeness? Commented Dec 21, 2018 at 13:34
  • is there a function in pandas that spit out the dataframe that I use so I can post it here so you guys can use it? I remember I did it once but forgot if its in R or python Commented Dec 21, 2018 at 13:38

1 Answer 1

2

Completely changed answer, because working differently with consecutive 1 values:

Explanation:

Solution remove each consecutive 1 first by where with chained boolean mask by comparing with ne (not equal !=) with shift to NaNs, forward filling them by ffill with limit parameter and last replace 0 back:

n_hold = 2
s = x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')

Timings and comparing outputs:

np.random.seed(123)
x = pd.Series(np.random.choice([0,1], p=(.8,.2), size=1000))
x1 = x.copy()
#print (x)


def orig(x):
    n_hold = 2
    entry_sig_diff = x.diff()
    entry_sig_dt = entry_sig_diff[entry_sig_diff == 1].index
    final_signal = x * 0
    for i in range(0, len(entry_sig_dt)):
        row_idx = entry_sig_diff.index.get_loc(entry_sig_dt[i])

        if (row_idx + n_hold) >= len(x):
            break

        final_signal[row_idx:(row_idx + n_hold + 1)] = 1
    return final_signal

#print (orig(x))

n_hold = 2
s = x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
#print (s)

df = pd.concat([x,orig(x1), s], axis=1, keys=('input', 'orig', 'new'))
print (df.head(20))
    input  orig  new
0       0     0    0
1       0     0    0
2       0     0    0
3       0     0    0
4       0     0    0
5       0     0    0
6       1     1    1
7       0     1    1
8       0     1    1
9       0     0    0
10      0     0    0
11      0     0    0
12      0     0    0
13      0     0    0
14      0     0    0
15      0     0    0
16      0     0    0
17      0     0    0
18      0     0    0
19      0     0    0

#check outputs
#print (s.values == orig(x).values)

Timings:

%timeit (orig(x))
24.8 ms ± 653 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
1.36 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sign up to request clarification or add additional context in comments.

3 Comments

is this faster then a simple loop?
@user1234440 - I hope so, best test it. Or if add your solution I can add timings.
@user1234440 - Unfortunately solution is a bit complicated, added new one, explanation and also timings.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.