2

You'll find snippets with reproducible input and an example of desired output at the end of the question.

The challenge:

I have a dataframe like this:

enter image description here

The dataframe has two columns with patterns of 1 and 0 like this:

enter image description here

Or this:

enter image description here

The number of columns will vary, and so will the length of the patterns. However, the only numbers in the dataframe will be 0 or 1.

I would like to identify these patterns, count each occurence of them, and build a dataframe containing the results. To simplify the whole thing, I'd like to focus on the ones, and ignore the zeros. The desired output in this particular case would be:

enter image description here

I'd like the procedure to identify that, as an example, the pattern [1,1,1] occurs two times in column_A, and not at all in column_B. Notice that I've used the sums of the patterns as indexes in the dataframe.

Reproducible input:

import pandas as pd
df = pd.DataFrame({'column_A':[1,1,1,0,0,0,1,0,0,1,1,1],
                   'column_B':[1,1,1,1,1,0,0,0,1,1,0,0]})

colnames = list(df)
df[colnames] = df[colnames].apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=len(df)).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
print(df)

Desired output:

df2 = pd.DataFrame({'pattern':[5,3,2,1],
               'column_A':[0,2,0,1],
               'column_B':[1,0,1,0]})
df2 = df2.set_index(['pattern'])
print(df2)

My attempts so far:

I've been working on a solution that includes nested for loops where I calculate running sums that are reset each time an observation equals zero. It also includes functions such as df.apply(lambda x: x.value_counts()). But it's messy to say the least, and so far not 100% correct.

Thank you for any other suggestions!

1

1 Answer 1

2

Here's my attempt:

def fun(ser):
    ser = ser.dropna()
    ser = ser.diff().fillna(ser)
    return ser.value_counts()


df.cumsum().where((df == 1) & (df != df.shift(-1))).apply(fun)
Out: 
     column_A  column_B
1.0       1.0       NaN
2.0       NaN       1.0
3.0       2.0       NaN
5.0       NaN       1.0

The first part (df.cumsum().where((df == 1) & (df != df.shift(-1)))) produces the cumulative sums:

            column_A  column_B
dates                         
2017-08-04       NaN       NaN
2017-08-05       NaN       NaN
2017-08-06       3.0       NaN
2017-08-07       NaN       NaN
2017-08-08       NaN       5.0
2017-08-09       NaN       NaN
2017-08-10       4.0       NaN
2017-08-11       NaN       NaN
2017-08-12       NaN       NaN
2017-08-13       NaN       7.0
2017-08-14       NaN       NaN
2017-08-15       7.0       NaN

So if we ignore the NaNs and take the diffs, we can have the values. That's what the function does: it drops the NaNs and then take the differences so it's not cumulative sum anymore. It finally returns the value counts.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for finding the time to explain the details in the solution as well!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.