Performance of using regex matched groups in pandas dataframe

Question

I have a pandas series of ~350k rows, and I want to apply the pandas.Series.str.extract function using a regular expression consisting of ~100 substrings, such as:

The extract is too slow: it takes 1 minute in my jupyter notebook (Python 3.9). Why is it so slow and how to speed it up?

Edit 1 I used 'itemX' as an example, but it can be substituted by any substring. The regular expression could be something like

'(carrageenan|dihydro|basketball|etc...)'

Edit 2 Answer to some comments:

I'm looking for exact matches
I already precompile the regex using re.compile()

I used 'itemX' as an example, but it can be any substring. The regular expression could be something like '(carrageenan|dihydro|basketball|etc...) — Brainless
– Brainless, Commented Jun 27, 2021 at 18:10
str.contains does too. If we check others, they also probably do. @SeaBean — Mustafa Aydın
– Mustafa Aydın, Commented Jun 27, 2021 at 19:16

Wiktor Stribiżew · Accepted Answer · 2021-06-27 19:22:00Z

5

In most cases, the problem with searching for multiple words is related to the fact that many of the search words share the same prefix, and the more such words are in the list, the more backtracking steps are required to find a match, which slows the code execution.

A regex trie will come to rescue here, together with word boundaries (since you need an exact match). Install pip install trieregex and use

from trieregex import TrieRegEx
keywords = ['item0','item1','item2','item3']
tr = TrieRegEx(*keywords)
pattern = fr'\b({tr.regex()})\b'

Then, you can use the pattern with .str.extract() method.

If you do not need to use some third party library to generate the regex trie, you can use the code from this SO post.

answered Jun 27, 2021 at 19:22

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jérôme Richard Over a year ago

Interesting, but isn't that the purpose of regexp compilation/engine to do such a work? I know that some engine generate a minimal deterministic automaton like RE and generally not the standard PCRE engine, but this simplification seems a basic one for a regexp engine...

Wiktor Stribiżew Over a year ago

@JérômeRichard The re regex engine does not do this under the hood.

Collectives™ on Stack Overflow

Performance of using regex matched groups in pandas dataframe

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related