4

I have a dataframe that looks like this:

                     night  DSWRF_integ
ForecastTime
2018-05-12 00:00:00    1.0            1
2018-05-12 00:15:00    0.0            1
2018-05-12 00:30:00    0.0            1
2018-05-12 00:45:00    0.0            1
2018-05-12 01:00:00    0.0            0
2018-05-12 01:15:00    0.0            0
2018-05-12 01:30:00    0.0            0
2018-05-12 01:45:00    0.0            0
2018-05-12 02:00:00    0.0            0
2018-05-12 02:15:00    0.0            0
2018-05-12 02:30:00    0.0            0
2018-05-12 02:45:00    0.0            0
2018-05-12 03:00:00    0.0            0
2018-05-12 03:15:00    0.0            0
2018-05-12 03:30:00    0.0            0
2018-05-12 03:45:00    0.0            0
2018-05-12 04:00:00    0.0            0
2018-05-12 04:15:00    0.0            0
2018-05-12 04:30:00    0.0            0
2018-05-12 04:45:00    0.0            0
2018-05-12 05:00:00    0.0            0
2018-05-12 05:15:00    0.0            0
2018-05-12 05:30:00    0.0            0
2018-05-12 05:45:00    0.0            0
2018-05-12 06:00:00    0.0            0
2018-05-12 06:15:00    0.0            0
2018-05-12 06:30:00    0.0            0
2018-05-12 06:45:00    0.0            0
2018-05-12 07:00:00    0.0            0
2018-05-12 07:15:00    0.0            0
2018-05-12 07:30:00    0.0            0
2018-05-12 07:45:00    0.0            0
2018-05-12 08:00:00    0.0            0
2018-05-12 08:15:00    0.0            0
2018-05-12 08:30:00    0.0            0
2018-05-12 08:45:00    0.0            0
2018-05-12 09:00:00    0.0            0
2018-05-12 09:15:00    0.0            0
2018-05-12 09:30:00    0.0            0
2018-05-12 09:45:00    0.0            0
2018-05-12 10:00:00    0.0            0
2018-05-12 10:15:00    0.0            0
2018-05-12 10:30:00    0.0            0
2018-05-12 10:45:00    0.0            0
2018-05-12 11:00:00    0.0            0
2018-05-12 11:15:00    0.0            1
2018-05-12 11:30:00    0.0            1
2018-05-12 11:45:00    0.0            1

2018-05-12 12:00:00    0.0            0
2018-05-12 12:15:00    0.0            0
2018-05-12 12:30:00    0.0            0
2018-05-12 12:45:00    0.0            0
2018-05-12 13:00:00    0.0            0
2018-05-12 13:15:00    0.0            0
2018-05-12 13:30:00    0.0            0
2018-05-12 13:45:00    0.0            0

2018-05-12 14:00:00    1.0            1
2018-05-12 14:15:00    1.0            1
2018-05-12 14:30:00    1.0            1
2018-05-12 14:45:00    1.0            1
2018-05-12 15:00:00    1.0            1

I am trying to figure out a logic, without iterating over the dataframe as it is too slow, to be able to convert consecutive zeros in the column DSWRF_integ to ones, only when the number of consecutive zeros is smaller than a specific threshold (for example threshold=10).

In this specific case, I would like to replace all the zeros in column DSWRF_integ, with ones, for the time period 2018-05-12 12:00:00 to 2018-05-12 13:45:00 , because the number of consecutive zeros there is smaller than 10.

The resulting dataframe should look like this:

                     night  DSWRF_integ
ForecastTime
2018-05-12 00:00:00    1.0            1
2018-05-12 00:15:00    0.0            1
2018-05-12 00:30:00    0.0            1
2018-05-12 00:45:00    0.0            1
2018-05-12 01:00:00    0.0            0
2018-05-12 01:15:00    0.0            0
2018-05-12 01:30:00    0.0            0
2018-05-12 01:45:00    0.0            0
2018-05-12 02:00:00    0.0            0
2018-05-12 02:15:00    0.0            0
2018-05-12 02:30:00    0.0            0
2018-05-12 02:45:00    0.0            0
2018-05-12 03:00:00    0.0            0
2018-05-12 03:15:00    0.0            0
2018-05-12 03:30:00    0.0            0
2018-05-12 03:45:00    0.0            0
2018-05-12 04:00:00    0.0            0
2018-05-12 04:15:00    0.0            0
2018-05-12 04:30:00    0.0            0
2018-05-12 04:45:00    0.0            0
2018-05-12 05:00:00    0.0            0
2018-05-12 05:15:00    0.0            0
2018-05-12 05:30:00    0.0            0
2018-05-12 05:45:00    0.0            0
2018-05-12 06:00:00    0.0            0
2018-05-12 06:15:00    0.0            0
2018-05-12 06:30:00    0.0            0
2018-05-12 06:45:00    0.0            0
2018-05-12 07:00:00    0.0            0
2018-05-12 07:15:00    0.0            0
2018-05-12 07:30:00    0.0            0
2018-05-12 07:45:00    0.0            0
2018-05-12 08:00:00    0.0            0
2018-05-12 08:15:00    0.0            0
2018-05-12 08:30:00    0.0            0
2018-05-12 08:45:00    0.0            0
2018-05-12 09:00:00    0.0            0
2018-05-12 09:15:00    0.0            0
2018-05-12 09:30:00    0.0            0
2018-05-12 09:45:00    0.0            0
2018-05-12 10:00:00    0.0            0
2018-05-12 10:15:00    0.0            0
2018-05-12 10:30:00    0.0            0
2018-05-12 10:45:00    0.0            0
2018-05-12 11:00:00    0.0            0
2018-05-12 11:15:00    0.0            1
2018-05-12 11:30:00    0.0            1
2018-05-12 11:45:00    0.0            1

2018-05-12 12:00:00    0.0            1
2018-05-12 12:15:00    0.0            1
2018-05-12 12:30:00    0.0            1
2018-05-12 12:45:00    0.0            1
2018-05-12 13:00:00    0.0            1
2018-05-12 13:15:00    0.0            1
2018-05-12 13:30:00    0.0            1
2018-05-12 13:45:00    0.0            1

2018-05-12 14:00:00    1.0            1
2018-05-12 14:15:00    1.0            1
2018-05-12 14:30:00    1.0            1
2018-05-12 14:45:00    1.0            1
2018-05-12 15:00:00    1.0            1

I have tried various approaches, using auxilliary columns but none of them has produced anything close to what I want. Any help would be highly appreciated :)

5
  • Can we know what you're trying so far ? So that we know what is "too slow" for you. Commented Feb 5, 2019 at 14:09
  • I have been trying by looping over the dataframe rows, both by using df.itertuples and df.iterrows and various conditional statements, but I am iterating through dataframes that have millions of rows so this approach is too slow. I have not kept any of the minimal examples I made, as I have been trying to achieve what I mention using logical indexing :) Commented Feb 5, 2019 at 14:14
  • 2
    try this Commented Feb 5, 2019 at 14:17
  • @Chris You should post this as an answer and link to it. Commented Feb 5, 2019 at 14:34
  • @IMCoins I do not really have the time at the moment. Feel free to post it as an answer. Commented Feb 5, 2019 at 14:37

1 Answer 1

3

You could do the following:

th = 3 # set threshold

# Sets to True rows that are 0
x = df.DSWRF_integ.eq(0)

# Takes the cumulative sum of rows where changes occur (thus where diff != 0)
g = x.astype(int).diff().fillna(0).ne(0).cumsum()

# Groups the original df with g and replaces 0 to 1 where the length of consecutive zeroes
# is smaller than the threshold
ix = x[x].groupby(g[x]).transform('size').lt(th) = 1
df.loc[ix[ix].index, 'DSWRF_integ'] = 1

Example

I've created this sample dataframe to more easily check the resulting dataframe. I've also created a final dataframe with all intermediate pd.Series added to it for a better understanding of all steps:

df = pd.DataFrame({'col1':[0,0,0,2,1,3,0,1,2,0,0,0,0,1]})

Now, setting for instance a threshold of 4, should turn to 1 all zeroes except those in rows 9 to 12:

result = df.copy()
th = 4
x = df.col1.eq(0)
g = x.astype(int).diff().fillna(0).ne(0).cumsum()
ix = x[x].groupby(g[x]).transform('size').lt(th) 
result.loc[ix[ix].index, 'col1'] = 1

df.assign(x=x, g=g, ix=ix, result=result)

     col1   x    g    ix     result
0      0   True  0   True       1
1      0   True  0   True       1
2      0   True  0   True       1
3      2  False  1    NaN       2
4      1  False  1    NaN       1
5      3  False  1    NaN       3
6      0   True  2   True       1
7      1  False  3    NaN       1
8      2  False  3    NaN       2
9      0   True  4  False       0
10     0   True  4  False       0
11     0   True  4  False       0
12     0   True  4  False       0
13     1  False  5    NaN       1
Sign up to request clarification or add additional context in comments.

7 Comments

Can you explain your answer ?
Yes @IMCoins, on it
Let me know if this helps @FanisSofianopoulos or of any doubts you have
@yatu thank you very much for the detailed answer and effort. This is not exactly what I want, but I was in the process of trying to make it work. What I want, is that 0 are converted to 1 only when the number of consecutive 0 is smaller than the specified threshold. What your code does, is that it does the opposite, that is, when the number of consecutive 0 is higher than the threshold :)
@yatu I did a mistake while adapting your example to my case - I corrected that and it now works exactly as it should. Thank you so much. I will now try to understand exactly what it does :D
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.