Efficient string search algorithm?

Question

I'll apologize now because this questions sounds stupid in my head and I am probably overlooking something very glaringly obvious. Anyway...

Okay so I am teaching myself scala and as a learning exercise I decided to implement a method that determines if one string contains another, smaller string. The first thing I did was use the naive version where I go to each letter of the string and start checking forward to see if each character matches. That went fine. Then I decided to implement a more efficient method and this is what I came up with (no special cases included):

// return true if a is a substring of b
def is_sub(a: String, b: String) : Boolean = {
  for(i <- 0 until b.length-a.length) { // O(n-m)
    if(a.hashCode == b.substring(i,a.length+i).hashCode) return true // O(1) + O(1) + O(1)
  }
  return false
}

First, can someone check my work and make sure I am right. I am prone to stupid mistakes. Second, can you check to make sure my time complexities are accurate. Assuming the first 2 are what I think they are why isn't this method mentioned on the wikipedia page for string searching algorithms? In theory this should be O(n-m) with no preprocessing space needed.

Where have I screwed up my analysis of this problem?

@alex23 - Substring is O(1) in java/scala. <at>Diode - I thought the time complexity of hashing was constant time but I guess I was wrong. — user439299
– user439299, Commented May 7, 2013 at 18:18
@user439299 - Substring was O(1) in java, but as of Java 7 update 6, the preference is to make a copy - in O(n). See mail.openjdk.java.net/pipermail/core-libs-dev/2012-May/… . — Andy Thomas
– Andy Thomas, Commented May 7, 2013 at 18:37
Gotcha, yes. thank you. I am still on java 6 now though so guess this doesn't apply to me yet (although it is good to know for when I do update) — user439299
– user439299, Commented May 7, 2013 at 18:50
Interesting stuff to think about, anyway. You can get around the O(n) substring problem by defining your hash function to just operate on the original string, and passing in the offset and length. As pointed out by others, an O(1) hash function is likely to have collisions, so when you get a matching hash, you have to do a brute force equality check to confirm the potential match. Basically, the better your hash function, the fewer collisions, but the hash function will also necessarily be more complex. — Markku K.
– Markku K., Commented May 7, 2013 at 20:43

templatetypedef · Accepted Answer · 2013-05-07 17:19:12Z

4

The code that you have posted is not guaranteed to be correct. If two strings are equal, then their hash codes must be the same, but the converse is not necessarily true. It is possible to find pairs of strings that are different strings, but which have the same hash code. Consequently, your function may return an incorrect answer if you find a substring with the same hash code as the string to search for.

Additionally, your complexity analysis is a bit incorrect. It takes time O(k) to compute a hash code for a string of length k (assuming you have a halfway decent hash function!), so this means that that on each iteration of the loop you will be doing O(n) work computing the hash code for the substring you've taken. Since you do this O(m) times, the total time complexity is O(mn), not O(m - n).

However, what you are doing is closely related to the Rabin-Karp string searching algorithm, which is indeed based on hashing strings. In order to avoid doing O(n) work on each iteration, the algorithm uses a rolling hash function that can be easily updated from one substring to the next in time O(1). It also has an extra check in place so that if the current hash code matches the hash code for the substring, the algorithm actually checks each character to make sure they match. This algorithm takes time O(mn) in the worst case, but in the average case is much, much faster (time O(m + n)).

Hope this helps!

answered May 7, 2013 at 17:19

templatetypedef

376k112 gold badges954 silver badges1.1k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Andy Thomas Over a year ago

+1 - Better answer than mine, covering correctness and similarity to Rabin-Karp.

user439299 Over a year ago

@templatetypedef - Looking at this again... Am I wrong in saying that the method I present is guaranteed to return true when the substring is present? (It may return false or true if the substring is not present due to hashing collisions). Also Hashing algorithms can be done in constant time (take n random letters; convert to number; plug in to hash function) but may be more prone to collisions. I was definitely wrong, and true string comparison must be done in constant time but still interesting to think about.

templatetypedef Over a year ago

@user439299- Your function is guaranteed to return true if the substring exists, but it might return true even if it's not. As for hashing - picking n random characters and plugging them into a hash function won't work unless you pick the n characters consistently across all strings, and if n is small this is an extremely poor hash function.

user439299 Over a year ago

Ah yes, sorry I should have said "arbitrary" characters not random ones. Also, as Andy Thomas-Cramer, points out substring is no longer O(1) as I thought it was. Holes being poked in my function left and right! haha

Andy Thomas · Accepted Answer · 2013-05-07 17:21:08Z

0

Your algorithm is still O(n m).

Let m be the length of the pattern, and n be the length of the searched string.

At each of (n - m) positions in the searched string, you're making a substring and computing its hashcode. Each of these requires iterating over m characters.

The complexity depends not only on the code you've written, but also the code you've called.

edited May 7, 2013 at 17:21

answered May 7, 2013 at 17:16

Andy Thomas

86.8k12 gold badges111 silver badges160 bronze badges

2 Comments

user439299 Over a year ago

Hashing can be constant time. Substring is constant time. The issue (as templatetypedef points out) is that the function - which always returns true when the substring is present - is not guarenteed to return false when the substring is not present.

Andy Thomas Over a year ago

@user439299 - Although hashing in general can be constant time, the definition for String is O(n). Substring was O(1) in java, but as of Java 7 update 6, the preference is to make a copy - in O(n). See mail.openjdk.java.net/pipermail/core-libs-dev/2012-May/… . Errors lie in both correctness and analysis.

Collectives™ on Stack Overflow

Efficient string search algorithm?

2 Answers 2

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related