2

I have a pandas series of ~350k rows, and I want to apply the pandas.Series.str.extract function using a regular expression consisting of ~100 substrings, such as:

'(item0|item1|item2|item3|item4|item5|item6|item7|item8|item9|item10|item11|item12|item13|item14|item15|item16|item17|item18|item19|item20|item21|item22|item23|item24|item25|item26|item27|item28|item29|item30|item31|item32|item33|item34|item35|item36|item37|item38|item39|item40|item41|item42|item43|item44|item45|item46|item47|item48|item49|item50|item51|item52|item53|item54|item55|item56|item57|item58|item59|item60|item61|item62|item63|item64|item65|item66|item67|item68|item69|item70|item71|item72|item73|item74|item75|item76|item77|item78|item79|item80|item81|item82|item83|item84|item85|item86|item87|item88|item89|item90|item91|item92|item93|item94|item95|item96|item97|item98|item99|item100)'

The extract is too slow: it takes 1 minute in my jupyter notebook (Python 3.9). Why is it so slow and how to speed it up?

Edit 1 I used 'itemX' as an example, but it can be substituted by any substring. The regular expression could be something like

'(carrageenan|dihydro|basketball|etc...)'

Edit 2 Answer to some comments:

  • I'm looking for exact matches
  • I already precompile the regex using re.compile()
8
  • Why not use pattern r'(item\d+)' ? Commented Jun 27, 2021 at 18:02
  • I used 'itemX' as an example, but it can be any substring. The regular expression could be something like '(carrageenan|dihydro|basketball|etc...) Commented Jun 27, 2021 at 18:10
  • 1
    @SeaBean pandas does that. Commented Jun 27, 2021 at 19:04
  • 1
    str.contains does too. If we check others, they also probably do. @SeaBean Commented Jun 27, 2021 at 19:16
  • 1
    That's really nice! Much appreciated @MustafaAydın Commented Jun 27, 2021 at 19:20

1 Answer 1

5

In most cases, the problem with searching for multiple words is related to the fact that many of the search words share the same prefix, and the more such words are in the list, the more backtracking steps are required to find a match, which slows the code execution.

A regex trie will come to rescue here, together with word boundaries (since you need an exact match). Install pip install trieregex and use

from trieregex import TrieRegEx
keywords = ['item0','item1','item2','item3']
tr = TrieRegEx(*keywords)
pattern = fr'\b({tr.regex()})\b'

Then, you can use the pattern with .str.extract() method.

If you do not need to use some third party library to generate the regex trie, you can use the code from this SO post.

Sign up to request clarification or add additional context in comments.

2 Comments

Interesting, but isn't that the purpose of regexp compilation/engine to do such a work? I know that some engine generate a minimal deterministic automaton like RE and generally not the standard PCRE engine, but this simplification seems a basic one for a regexp engine...
@JérômeRichard The re regex engine does not do this under the hood.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.