Python string search efficiency

Question

For very large strings (spanning multiple lines) is it faster to use Python's built-in string search or to split the large string (perhaps on \n) and iteratively search the smaller strings?

E.g., for very large strings:

for l in get_mother_of_all_strings().split('\n'):
 if 'target' in l:
   return True
return False

or

return 'target' in get_mother_of_all_strings()

get_mother_of_all_strings is such a funny name for a function haha' — BrainStorm
– BrainStorm, Commented Aug 5, 2011 at 22:30
There's probably a Boyer-Moore or similar algorithm under the covers, which just loves going through big data. Use the full string. — Mark Ransom
– Mark Ransom, Commented Aug 5, 2011 at 22:42
This is easy to test by yourself, generate a very long string that doesn't contain the string you're looking for (worst case scenario) like 'a\n' * 100000000 and compare the execution times. — Bertrand Marron
– Bertrand Marron, Commented Aug 5, 2011 at 22:45

GaretJax · Accepted Answer · 2011-08-05 23:06:42Z

~~Probably~~ Certainly the second, I don't see any difference in doing a search in a big string or many in small strings. You may skip some chars thanks to the shorter lines, but the split operation has its costs too (searching for \n, creating n different strings, creating the list) and the loop is done in python.

The string __contain__ method is implemented in C and so noticeably faster.

Also consider that the second method aborts as soon as the first match is found, but the first one splits all the string before even starting to search inside it.

This is rapidly proven with a simple benchmark:

import timeit

prepare = """
with open('bible.txt') as fh:
    text = fh.read()
"""

presplit_prepare = """
with open('bible.txt') as fh:
    text = fh.read()
lines = text.split('\\n')
"""

longsearch = """
'hello' in text
"""

splitsearch = """
for line in text.split('\\n'):
    if 'hello' in line:
        break
"""

presplitsearch = """
for line in lines:
    if 'hello' in line:
        break
"""


benchmark = timeit.Timer(longsearch, prepare)
print "IN on big string takes:", benchmark.timeit(1000), "seconds"

benchmark = timeit.Timer(splitsearch, prepare)
print "IN on splitted string takes:", benchmark.timeit(1000), "seconds"

benchmark = timeit.Timer(presplitsearch, presplit_prepare)
print "IN on pre-splitted string takes:", benchmark.timeit(1000), "seconds"

The result is:

IN on big string takes: 4.27126097679 seconds
IN on splitted string takes: 35.9622690678 seconds
IN on pre-splitted string takes: 11.815297842 seconds

The bible.txt file actually is the bible, I found it here: http://patriot.net/~bmcgin/kjvpage.html (text version)

TorelTwiddler · Accepted Answer · 2011-08-05 22:37:38Z

1

import timeit

a = #a really long string with multiple lines and target near the end

timeit.timeit(stmt='["target" in x for x in a.split("\\n")]', setup='a = """%s"""'%a)
23.499058284211792
timeit.timeit(stmt='"target" in a', setup='a = """%s"""'%a)
5.2557157624293325

So the large string is MUCH faster to search through than a split list of smaller ones.

answered Aug 5, 2011 at 22:37

TorelTwiddler

6,1963 gold badges34 silver badges39 bronze badges

Comments

Karoly Horvath · Accepted Answer · 2011-08-05 22:38:14Z

1

The second one is a lot faster, here are some measurement data:

def get_mother_of_all_strings():
    return "abcdefg\nhijklmnopqr\nstuvwxyz\naatargetbb"

first: 2.00
second: 0.26

answered Aug 5, 2011 at 22:38

Karoly Horvath

96.7k11 gold badges123 silver badges181 bronze badges

Comments

Jason Sundram · Accepted Answer · 2014-03-01 23:04:53Z

1

If you are only matching once to see if the substring is in the string at all, then both methods are about the same, and you get more overhead for splitting it into separate line by line searches; so the large string search is a bit faster.

If you have to do multiple matches, then I would tokenize the string and stuff them into a dictionary or set and store it in memory.

s = 'SOME REALLY LONG STRING'
tokens = set(s.split())
return substring in tokens

edited Mar 1, 2014 at 23:04

Jason Sundram

12.6k20 gold badges73 silver badges86 bronze badges

answered Aug 5, 2011 at 22:32

jontsai

6922 gold badges6 silver badges14 bronze badges

Comments

HYRY · Accepted Answer · 2011-08-05 22:26:50Z

0

for loop in python is slow, and split a large string is also slow. so search the large string is much faster.

answered Aug 5, 2011 at 22:26

HYRY

97.8k28 gold badges197 silver badges192 bronze badges

Comments

BrainStorm · Accepted Answer · 2011-08-05 22:28:25Z

0

The second way is faster, the slitting adds one more O(n) iteration of searching and delimiting, the alocating memory for each sublist, then close to O(n^2) to iter each sublist and search a string in them. while just O(n) to search the bigger string.

answered Aug 5, 2011 at 22:28

BrainStorm

2,0461 gold badge16 silver badges23 bronze badges

2 Comments

GaretJax Over a year ago

Your complexity arguments are fallacious... it takes more O(n) iterations, but it never becomes an O(n^2) operation because the initial data set is the same.

BrainStorm Over a year ago

@GaretJax, I see, but anyway, the conclusion is true.

Collectives™ on Stack Overflow

Python string search efficiency

6 Answers 6

Comments

Comments

Comments

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

Comments

Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related