Index a pandas dataframe based on row string value conditionally containing some row specific regex

Question

I would like to conditionally select rows in a pandas dataframe if a string value contains some other string values, defined as a regex. The string values to check for change per row, and right now are stored in a series, with the formats displayed below:

df = pd.DataFrame(["a", "a", "b", "c", "de", "de"], columns=["Value"])

df:
| Index   | Value   |
|   0     | "a"     |
|   1     | "a"     |
|   2     | "b"     |
|   3     | "c"     |
|   4     | "de"    |
|   5     | "de"    |

series = pd.Series(["a|b|c", "a", "d|e", "c", "c|a", "f|e"])

Series with contains regex per row:
| Index   | Value   |
|   0     | "a|b|c" |
|   1     | "a"     |
|   2     | "d|e"   |
|   3     | "c"     |
|   4     | "c|a"   |
|   5     | "f|e"   |

The expected output I want would be a mask that I can use to index the dataframe only to the rows that match the regex:

mask = [True, True, False, True, False, True] 

df[mask]: 
| Index   | Value   |
|   0     | "a"     |
|   1     | "a"     |
|   3     | "c"     |
|   5     | "de"    |

I would like to avoid lambdas and apply as much as possible, since I am processing a big dataset and I need execution to be as performant as possible

Thanks a lot,

Just to be sure, you want "de" : "f|e" to be "True"?

Drakax
– Drakax

2022-05-30 12:48:43 +00:00
Commented May 30, 2022 at 12:48 — Drakax
– Drakax, Commented May 30, 2022 at 12:48

Zoe - Save the data dump · Accepted Answer · 2022-06-02 08:42:15Z

1

Well, I will let you timeit the code below:

First concat the "regex" serie to the original DF:

df = pd.DataFrame(["a", "a", "b", "c", "de", "de"], columns=["value"])
regex = pd.Series(["a|b|c", "a", "d|e", "c", "c|a", "f|e"], name="regex" )
df = pd.concat([df, regex], axis=1)
df

Result:

index	value	regex
0	a	a\|b
1	a	a
2	b	d\|e
3	c	c
4	de	c\|a
5	de	f\|e

Then, create and apply your "mask":

mask = df.apply(lambda r: any(s in r['value'] for s in r['regex']), axis=1)
df2 = df[mask].drop('regex', axis=1)
df2

Result:

index	value
0	a
1	a
3	c
5	de

edited Jun 2, 2022 at 8:42

Zoe - Save the data dump

28.4k22 gold badges130 silver badges163 bronze badges

answered May 30, 2022 at 13:39

Drakax

1,5735 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Glauco · Accepted Answer · 2022-05-30 09:45:02Z

0

If the regex are all different, like in the example, no way, each RE must be applied to all Value, this means a minimum complexity of O(n).

Maybe the fastest solution is, (using apply):

df['re'] = series
df[['Value','re']].T.apply(lambda x: bool(re.match(x.Value,x.re)))

output: 
0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool

answered May 30, 2022 at 9:45

Glauco

1,4712 gold badges11 silver badges21 bronze badges

1 Comment

Drakax Over a year ago

Given his output example, index 5 should be "True".

Mohammed · Accepted Answer · 2022-05-30 11:50:01Z

    def common_char(string1 , string2):
        
        string1_list = list(string1)
    
        for i in string1_list:
            if i in string2:
                return True
            
        
        return False
    
    
    df = pd.DataFrame(["a", "a", "b", "c", "de", "de"])
    series = pd.Series(["a|b|c", "a", "d|e", "c", "c|a", "f|e"])
    
    
    true_false_list = []
    resulted_data = pd.DataFrame({"index":[0],
                                "value":["0"]})
    
    
    #I added Initial values just in order to make the data type of ( index ) Column integer,
    #so I'm removing it here.
    resulted_data = resulted_data.drop(0)  
    
    for i in range(df.shape[0]):
    
        if common_char(df[0][i] , series[i]):
            true_false_list.append(True)
            resulted_data = resulted_data.append(pd.DataFrame({ "index":[i],
                                              "value":[df[0][i]]}))
            
            
        else:
            true_false_list.append(False)


output:

index value
0      0     a
0      1     a
0      3     c
0      5    de

Collectives™ on Stack Overflow

Index a pandas dataframe based on row string value conditionally containing some row specific regex

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related