0

I would like to conditionally select rows in a pandas dataframe if a string value contains some other string values, defined as a regex. The string values to check for change per row, and right now are stored in a series, with the formats displayed below:

df = pd.DataFrame(["a", "a", "b", "c", "de", "de"], columns=["Value"])

df:
| Index   | Value   |
|   0     | "a"     |
|   1     | "a"     |
|   2     | "b"     |
|   3     | "c"     |
|   4     | "de"    |
|   5     | "de"    |

series = pd.Series(["a|b|c", "a", "d|e", "c", "c|a", "f|e"])

Series with contains regex per row:
| Index   | Value   |
|   0     | "a|b|c" |
|   1     | "a"     |
|   2     | "d|e"   |
|   3     | "c"     |
|   4     | "c|a"   |
|   5     | "f|e"   |

The expected output I want would be a mask that I can use to index the dataframe only to the rows that match the regex:

mask = [True, True, False, True, False, True] 

df[mask]: 
| Index   | Value   |
|   0     | "a"     |
|   1     | "a"     |
|   3     | "c"     |
|   5     | "de"    |

I would like to avoid lambdas and apply as much as possible, since I am processing a big dataset and I need execution to be as performant as possible

Thanks a lot,

1
  • Just to be sure, you want "de" : "f|e" to be "True"? Commented May 30, 2022 at 12:48

3 Answers 3

1

Well, I will let you timeit the code below:

First concat the "regex" serie to the original DF:

df = pd.DataFrame(["a", "a", "b", "c", "de", "de"], columns=["value"])
regex = pd.Series(["a|b|c", "a", "d|e", "c", "c|a", "f|e"], name="regex" )
df = pd.concat([df, regex], axis=1)
df

Result:

index value regex
0 a a|b
1 a a
2 b d|e
3 c c
4 de c|a
5 de f|e

Then, create and apply your "mask":

mask = df.apply(lambda r: any(s in r['value'] for s in r['regex']), axis=1)
df2 = df[mask].drop('regex', axis=1)
df2

Result:

index value
0 a
1 a
3 c
5 de
Sign up to request clarification or add additional context in comments.

Comments

0

If the regex are all different, like in the example, no way, each RE must be applied to all Value, this means a minimum complexity of O(n).

Maybe the fastest solution is, (using apply):

df['re'] = series
df[['Value','re']].T.apply(lambda x: bool(re.match(x.Value,x.re)))

output: 
0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool

1 Comment

Given his output example, index 5 should be "True".
0
    def common_char(string1 , string2):
        
        string1_list = list(string1)
    
        for i in string1_list:
            if i in string2:
                return True
            
        
        return False
    
    
    df = pd.DataFrame(["a", "a", "b", "c", "de", "de"])
    series = pd.Series(["a|b|c", "a", "d|e", "c", "c|a", "f|e"])
    
    
    true_false_list = []
    resulted_data = pd.DataFrame({"index":[0],
                                "value":["0"]})
    
    
    #I added Initial values just in order to make the data type of ( index ) Column integer,
    #so I'm removing it here.
    resulted_data = resulted_data.drop(0)  
    
    for i in range(df.shape[0]):
    
        if common_char(df[0][i] , series[i]):
            true_false_list.append(True)
            resulted_data = resulted_data.append(pd.DataFrame({ "index":[i],
                                              "value":[df[0][i]]}))
            
            
        else:
            true_false_list.append(False)


output:

index value
0      0     a
0      1     a
0      3     c
0      5    de

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.