5

I'd like to find a particular pattern in a pandas dataframe column, and return the corresponding index values in order to subset the dataframe.

Here's a sample dataframe with a possible pattern:

Snippet to produce dataframe:

import pandas as pd
import numpy as np

Observations = 10
Columns = 2
np.random.seed(123)
df = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
                  columns = ['ColA','ColB'])
datelist = pd.date_range(pd.datetime(2017, 7, 7).strftime('%Y-%m-%d'),
                         periods=Observations).tolist()
df['Dates'] = datelist
df = df.set_index(['Dates'])

pattern = [100,90,105]
print(df)

Dataframe:

            ColA  ColB
Dates                 
2017-07-07   103    92
2017-07-08    92    96
2017-07-09   107   109
2017-07-10   100    91
2017-07-11    90   107
2017-07-12   105    99
2017-07-13    90   104
2017-07-14    90   105
2017-07-15   109   104
2017-07-16    94    90

Here, the pattern of interest occurs in Column A on the dates 2017-07-10 to 2017-07-12, and that's what I'd like to end up with:

Desired output:

2017-07-10   100    91
2017-07-11    90   107
2017-07-12   105    99

If the same pattern occurs several times, I would like to subset the dataframe the same way, and also count how many times the pattern occurs, but I hope that's more straight forward as long as I get the first step sorted out.

Thank you for any suggestions!

5 Answers 5

7

Using the magic of list comprehensions:

[df.index[i - len(pattern)] # Get the datetime index 
 for i in range(len(pattern), len(df)) # For each 3 consequent elements 
 if all(df['ColA'][i-len(pattern):i] == pattern)] # If the pattern matched 

# [Timestamp('2017-07-10 00:00:00')]
Sign up to request clarification or add additional context in comments.

1 Comment

This is by far the most elegant solution.
6

Here is a solution:

Check if the pattern was found in any of the columns using rolling. This will give you the last index of the group matching the pattern

matched = df.rolling(len(pattern)).apply(lambda x: all(np.equal(x, pattern)))
matched = matched.sum(axis = 1).astype(bool)   #Sum to perform boolean OR

matched
Out[129]: 
Dates
2017-07-07    False
2017-07-08    False
2017-07-09    False
2017-07-10    False
2017-07-11    False
2017-07-12     True
2017-07-13    False
2017-07-14    False
2017-07-15    False
2017-07-16    False
dtype: bool

For each match, add the indexes of the complete pattern:

idx_matched = np.where(matched)[0]
subset = [range(match-len(pattern)+1, match+1) for match in idx_matched]

Get all the patterns:

result = pd.concat([df.iloc[subs,:] for subs in subset], axis = 0)

result
Out[128]: 
            ColA  ColB
Dates                 
2017-07-10   100    91
2017-07-11    90   107
2017-07-12   105    99

Comments

1

The shortest way is finding the index at which the pattern starts. Then you just need to select the three following rows.

In order to find these indexes, a one-liner is enough:

indexes=df[(df.ColA==pattern[0])&(df["ColA"].shift(-1)==pattern[1])&(df["ColA"].shift(-2)==pattern[2])].index

Then do as the other answer says to get the subsets that you want.

Comments

1
for col in df:
    index = df[col][(df[col] == pattern[0]) & (df[col].shift(-1) == pattern[1]) & (df[col].shift(-2) == pattern[2])].index
    if not index.empty: print(index)

Comments

1

While Deena's answer is elegant, it is horribly slow for longer sequences.

Flab's solution on the other hand is possibly faster, but it only works for numeric values.

Therefore, I want to propose a different solution, that is fast and works for all data-types. For the following I assume the data-type to be a string. If it is not in your case, you should convert it first.

rolling_cat = lambda s, n: pd.Series(zip(*[s.shift(-x) for x in range(n)])).str.join(",")

indices = df.loc[rolling_cat(df["ColA"], n=len(pattern) == (",".join(pattern))].index

Here, I define a function that concatenates every the next n entries to each entry in the column. This means, that each row now contains the potential pattern in the form of a string

Then I compare these patterns to a string representation of the pattern by concatenating the each entry of the pattern.

The reason why this is much faster than Deena's approach is, that string comparison is much faster than comparing each item in a list.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.