Implementing a More Efficient Python Algorithm for Substrings in Lists of Strings

Question

I'm doing the below in Python

I have a problem that involves checking a large list(size n) of input strings to see if the substrings contain any words from a large dictionary(size m).

I've looked around for efficient algorithms for this problem and found these: https://github.com/laurentluce/python-algorithms/blob/master/algorithms/string_matching.py

the Rabin-Karp and KMP matching algorithms.(note that I've replaced the ord() function for the Rabin-Karp with a dictionary for efficiency)

However, these actually perform slower than using an 'in' operation in Python which uses the Boyer–Moore–Horspool algorithm. I suppose that this is because the contains() method invoked by 'in' was is implemented in C.

How can I override this method with the Rabin-Karp for the string class in Python in C?

alexis · Accepted Answer · 2014-12-15 10:05:04Z

2

You could have a look at cython:

http://docs.cython.org/src/quickstart/cythonize.html

I find it easier to write some custom code for a specific operation than overriding a core python structure.

answered Dec 15, 2014 at 10:05

alexis

3712 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

holdenweb Over a year ago

And of course it might be worth trying PyPy, which offers significant speedups against standard CPython

alexis Over a year ago

Thx for the update, I didn't use CPython in a long time, but a quick google search shows things moved quite a bit. PyPy seems the way to go !

holdenweb · Accepted Answer · 2014-12-15 10:01:18Z

1

You can't, I am sorry to say.

The built-in types, being constructed at the time the interpreter is compiled, cannot be patched at run-time. If speed is really so important then you might want to write a C extension type that subclasses the built-in string type but with a different contains method.

answered Dec 15, 2014 at 10:01

holdenweb

37.8k7 gold badges62 silver badges80 bronze badges

Collectives™ on Stack Overflow

Implementing a More Efficient Python Algorithm for Substrings in Lists of Strings

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related