Pandas: How to find a particular pattern in a dataframe column?

Question

I'd like to find a particular pattern in a pandas dataframe column, and return the corresponding index values in order to subset the dataframe.

Here's a sample dataframe with a possible pattern:

Snippet to produce dataframe:

import pandas as pd
import numpy as np

Observations = 10
Columns = 2
np.random.seed(123)
df = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
                  columns = ['ColA','ColB'])
datelist = pd.date_range(pd.datetime(2017, 7, 7).strftime('%Y-%m-%d'),
                         periods=Observations).tolist()
df['Dates'] = datelist
df = df.set_index(['Dates'])

pattern = [100,90,105]
print(df)

Dataframe:

            ColA  ColB
Dates                 
2017-07-07   103    92
2017-07-08    92    96
2017-07-09   107   109
2017-07-10   100    91
2017-07-11    90   107
2017-07-12   105    99
2017-07-13    90   104
2017-07-14    90   105
2017-07-15   109   104
2017-07-16    94    90

Here, the pattern of interest occurs in Column A on the dates 2017-07-10 to 2017-07-12, and that's what I'd like to end up with:

Desired output:

2017-07-10   100    91
2017-07-11    90   107
2017-07-12   105    99

If the same pattern occurs several times, I would like to subset the dataframe the same way, and also count how many times the pattern occurs, but I hope that's more straight forward as long as I get the first step sorted out.

Thank you for any suggestions!

Deena · Accepted Answer · 2017-07-07 11:36:14Z

7

Using the magic of list comprehensions:

[df.index[i - len(pattern)] # Get the datetime index 
 for i in range(len(pattern), len(df)) # For each 3 consequent elements 
 if all(df['ColA'][i-len(pattern):i] == pattern)] # If the pattern matched 

# [Timestamp('2017-07-10 00:00:00')]

edited Jul 7, 2017 at 11:36

answered Jul 7, 2017 at 11:15

Deena

6,2236 gold badges36 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

baloo Over a year ago

This is by far the most elegant solution.

FLab · Accepted Answer · 2017-07-07 11:08:23Z

Here is a solution:

Check if the pattern was found in any of the columns using rolling. This will give you the last index of the group matching the pattern

matched = df.rolling(len(pattern)).apply(lambda x: all(np.equal(x, pattern)))
matched = matched.sum(axis = 1).astype(bool)   #Sum to perform boolean OR

matched
Out[129]: 
Dates
2017-07-07    False
2017-07-08    False
2017-07-09    False
2017-07-10    False
2017-07-11    False
2017-07-12     True
2017-07-13    False
2017-07-14    False
2017-07-15    False
2017-07-16    False
dtype: bool

For each match, add the indexes of the complete pattern:

idx_matched = np.where(matched)[0]
subset = [range(match-len(pattern)+1, match+1) for match in idx_matched]

Get all the patterns:

result = pd.concat([df.iloc[subs,:] for subs in subset], axis = 0)

result
Out[128]: 
            ColA  ColB
Dates                 
2017-07-10   100    91
2017-07-11    90   107
2017-07-12   105    99

baloo · Accepted Answer · 2017-07-07 11:09:13Z

1

The shortest way is finding the index at which the pattern starts. Then you just need to select the three following rows.

In order to find these indexes, a one-liner is enough:

indexes=df[(df.ColA==pattern[0])&(df["ColA"].shift(-1)==pattern[1])&(df["ColA"].shift(-2)==pattern[2])].index

Then do as the other answer says to get the subsets that you want.

answered Jul 7, 2017 at 11:09

baloo

5271 gold badge5 silver badges13 bronze badges

Comments

SayPy · Accepted Answer · 2017-07-07 11:11:30Z

1

for col in df:
    index = df[col][(df[col] == pattern[0]) & (df[col].shift(-1) == pattern[1]) & (df[col].shift(-2) == pattern[2])].index
    if not index.empty: print(index)

answered Jul 7, 2017 at 11:11

SayPy

5661 gold badge4 silver badges13 bronze badges

Comments

chillichief · Accepted Answer · 2021-10-20 20:07:41Z

While Deena's answer is elegant, it is horribly slow for longer sequences.

Flab's solution on the other hand is possibly faster, but it only works for numeric values.

Therefore, I want to propose a different solution, that is fast and works for all data-types. For the following I assume the data-type to be a string. If it is not in your case, you should convert it first.

rolling_cat = lambda s, n: pd.Series(zip(*[s.shift(-x) for x in range(n)])).str.join(",")

indices = df.loc[rolling_cat(df["ColA"], n=len(pattern) == (",".join(pattern))].index

Here, I define a function that concatenates every the next n entries to each entry in the column. This means, that each row now contains the potential pattern in the form of a string

Then I compare these patterns to a string representation of the pattern by concatenating the each entry of the pattern.

The reason why this is much faster than Deena's approach is, that string comparison is much faster than comparing each item in a list.

Collectives™ on Stack Overflow

Pandas: How to find a particular pattern in a dataframe column?

5 Answers 5

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related