1

I'll apologize now because this questions sounds stupid in my head and I am probably overlooking something very glaringly obvious. Anyway...

Okay so I am teaching myself scala and as a learning exercise I decided to implement a method that determines if one string contains another, smaller string. The first thing I did was use the naive version where I go to each letter of the string and start checking forward to see if each character matches. That went fine. Then I decided to implement a more efficient method and this is what I came up with (no special cases included):

// return true if a is a substring of b
def is_sub(a: String, b: String) : Boolean = {
  for(i <- 0 until b.length-a.length) { // O(n-m)
    if(a.hashCode == b.substring(i,a.length+i).hashCode) return true // O(1) + O(1) + O(1)
  }
  return false
}

First, can someone check my work and make sure I am right. I am prone to stupid mistakes. Second, can you check to make sure my time complexities are accurate. Assuming the first 2 are what I think they are why isn't this method mentioned on the wikipedia page for string searching algorithms? In theory this should be O(n-m) with no preprocessing space needed.

Where have I screwed up my analysis of this problem?

5
  • 1
    Did you consider the time complexity of hashing? Commented May 7, 2013 at 17:13
  • @alex23 - Substring is O(1) in java/scala. <at>Diode - I thought the time complexity of hashing was constant time but I guess I was wrong. Commented May 7, 2013 at 18:18
  • 1
    @user439299 - Substring was O(1) in java, but as of Java 7 update 6, the preference is to make a copy - in O(n). See mail.openjdk.java.net/pipermail/core-libs-dev/2012-May/… . Commented May 7, 2013 at 18:37
  • Gotcha, yes. thank you. I am still on java 6 now though so guess this doesn't apply to me yet (although it is good to know for when I do update) Commented May 7, 2013 at 18:50
  • Interesting stuff to think about, anyway. You can get around the O(n) substring problem by defining your hash function to just operate on the original string, and passing in the offset and length. As pointed out by others, an O(1) hash function is likely to have collisions, so when you get a matching hash, you have to do a brute force equality check to confirm the potential match. Basically, the better your hash function, the fewer collisions, but the hash function will also necessarily be more complex. Commented May 7, 2013 at 20:43

2 Answers 2

4

The code that you have posted is not guaranteed to be correct. If two strings are equal, then their hash codes must be the same, but the converse is not necessarily true. It is possible to find pairs of strings that are different strings, but which have the same hash code. Consequently, your function may return an incorrect answer if you find a substring with the same hash code as the string to search for.

Additionally, your complexity analysis is a bit incorrect. It takes time O(k) to compute a hash code for a string of length k (assuming you have a halfway decent hash function!), so this means that that on each iteration of the loop you will be doing O(n) work computing the hash code for the substring you've taken. Since you do this O(m) times, the total time complexity is O(mn), not O(m - n).

However, what you are doing is closely related to the Rabin-Karp string searching algorithm, which is indeed based on hashing strings. In order to avoid doing O(n) work on each iteration, the algorithm uses a rolling hash function that can be easily updated from one substring to the next in time O(1). It also has an extra check in place so that if the current hash code matches the hash code for the substring, the algorithm actually checks each character to make sure they match. This algorithm takes time O(mn) in the worst case, but in the average case is much, much faster (time O(m + n)).

Hope this helps!

Sign up to request clarification or add additional context in comments.

4 Comments

+1 - Better answer than mine, covering correctness and similarity to Rabin-Karp.
@templatetypedef - Looking at this again... Am I wrong in saying that the method I present is guaranteed to return true when the substring is present? (It may return false or true if the substring is not present due to hashing collisions). Also Hashing algorithms can be done in constant time (take n random letters; convert to number; plug in to hash function) but may be more prone to collisions. I was definitely wrong, and true string comparison must be done in constant time but still interesting to think about.
@user439299- Your function is guaranteed to return true if the substring exists, but it might return true even if it's not. As for hashing - picking n random characters and plugging them into a hash function won't work unless you pick the n characters consistently across all strings, and if n is small this is an extremely poor hash function.
Ah yes, sorry I should have said "arbitrary" characters not random ones. Also, as Andy Thomas-Cramer, points out substring is no longer O(1) as I thought it was. Holes being poked in my function left and right! haha
0

Your algorithm is still O(n m).

Let m be the length of the pattern, and n be the length of the searched string.

At each of (n - m) positions in the searched string, you're making a substring and computing its hashcode. Each of these requires iterating over m characters.

The complexity depends not only on the code you've written, but also the code you've called.

2 Comments

Hashing can be constant time. Substring is constant time. The issue (as templatetypedef points out) is that the function - which always returns true when the substring is present - is not guarenteed to return false when the substring is not present.
@user439299 - Although hashing in general can be constant time, the definition for String is O(n). Substring was O(1) in java, but as of Java 7 update 6, the preference is to make a copy - in O(n). See mail.openjdk.java.net/pipermail/core-libs-dev/2012-May/… . Errors lie in both correctness and analysis.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.