2

I wrote a function that is given 2 strings, s and a and it is supposed to return position of the first occurrence of a in s. It works fine with a as one character, but otherwise it stops working if the first occurrence is after 3rd character in s.

I've already checked if mul and add work, if hash of a is correct and I've reduced bases to 10 and 100 (which are not very good for hashing cause they're not prime) and it worked (on strings of length of 20). This might mean that modulo doesn't work as expected.

function getIndex(s, a) {

    // 2 bases and 2 mods to reduce the number of collisions

    var base1 = 31337;
    var base2 = 31357;

    var mod1 = 1e9 + 7;
    var mod2 = 1e9 + 9;

    //suffix arrays

    var hs1 = new Uint32Array(s.length);
    var hs2 = new Uint32Array(s.length);

    // bases to the power of a.length

    var ba1 = 1;
    var ba2 = 1;

    // hashes for a

    var ha1 = 0;
    var ha2 = 0;

    //operators

    var mul = (x, y, mod) => (x * y) % mod;

    var add = (x, y, mod) => {
        x += y;
        if(x >= mod) x -= mod;
        if(x < 0) x += mod;
        return x;
    }

    //get hash of a and find value of ba1 and ba2

    for(var i = 0; i < a.length; i++) {
        ha1 = add(mul(ha1, base1, mod1), a.charCodeAt(i), mod1);
        ha2 = add(mul(ha2, base2, mod2), a.charCodeAt(i), mod2);

        ba1 = mul(ba1, base1, mod1);
        ba2 = mul(ba2, base2, mod2);
    }

    //make suffix array

    var h1 = 0;
    var h2 = 0;

    for(var i = 0; i < s.length; i++) {
        h1 = add(mul(h1, base1, mod1), s.charCodeAt(i), mod1);
        h2 = add(mul(h2, base2, mod2), s.charCodeAt(i), mod2);

        hs1[i] = h1;
        hs2[i] = h2;
    }

    //Compare hashes of substrings of s (by removing prefix from the current element) with hash of a

    for(var i = a.length - 1; i < s.length; i++) {
        var h1 = hs1[i];
        var h2 = hs2[i];
        if(i >= a.length) {
            console.log(i, i - a.length, h1);
            h1 = add(h1, -mul(hs1[i - a.length], ba1, mod1), mod1);
            h2 = add(h2, -mul(hs2[i - a.length], ba2, mod2), mod2);
        }
        if(h1 == ha1 && h2 == ha2) return i - a.length + 1;
    }

    return -1;
}
getIndex("abcdefgh", "f") //returns 5
getIndex("abcdefgh", "fg")//returns -1
5
  • Can you please explain what is the purpose of your hash function? What does it have to do with the task to retrieve the first index where a substring occurs in another string? This could be simply done with s.indexOf(a) as you might know. Commented Sep 5, 2019 at 14:18
  • Cause in this example, the complexity of indexOf would be (n - m) * m (or simplified O(n * m) ), but with hashing (in this code it's purpose is to turn strings to number) and suffix array I think the complexity will be O(n) (1 for loop from 0 to m and 2 for loops from 0 to n). Commented Sep 5, 2019 at 21:04
  • I haven't explained what was it's purpose, using hashing I make a suffix array of numbers which represent suffixes of the string. Due to the properties of this hash function (if I take (x - len)th elem from suffix array, multiply it by base to power of len and subtract from xth element, I'll get the hash of substring from (x - len)th element to xth element) I can get hash of every substring very fast and compare it to the hash of the string I am looking for (which is comparing numbers and should also be pretty fast). Commented Sep 5, 2019 at 21:14
  • Ok, so you want to compare hashed strings. I thought that your function should still compare plain text strings. Commented Sep 6, 2019 at 5:37
  • Your "suffix array" seems to store the hashes of prefixes? Commented Sep 8, 2019 at 1:54

1 Answer 1

1

With a bit more logging, it becomes clear that you're victim to floating point inaccuracies. JS numbers can only represent 52 bit of integers precisely.

In getIndex("abcdefgh", "fg"), the searched hash ha1 is 3196477, your base multiplier ba1 is 982007569, and in the sixth iteration the prefix hashes are hs1[6] = 73644174 and hs1[4] = 800389532. If you put those in your arithmetic function, the result is 3196441, off by 6. This is because 800389532 * 982007569 is about two orders of magnitude larger than the maximum safe integer and in JS does evaluate to 785988578572367700, which is clearly wrong.

You would need to either choose your modulus and base smaller, or use bigints for all your calculations.

Sign up to request clarification or add additional context in comments.

9 Comments

I've changed line 28 to var mul = (x, y, mod) => Number((BigInt(x) * BigInt(y)) % BigInt(mod)); and now it works, but I don't know how fast BigInt is. I found out that it's size is only limited by memory, so I am not sure about how fast it is (mathematical operations and conversion).
I also don't think it would be smart to reduce bases and mods cause it would make space for more collisions, matching hashes of strings that are not same.
@Franx1024 You have to deal with collisions anyway, they're always bound to happen. When finding a possible solution with the hashes, you always need to confirm it by actually comparing the strings.
@Franx1024 Regarding performance, bigints a probably quite slow. They would however allow you to use only a single, big hash instead of two small hashes with different bases and modulos, so that you can get rid of all that code duplication. And yes, then you'll have to benchmark against indexOf. I'm pretty certain that JS engines did implement a string search algorithm that does not have the naive quadratic runtime.
I found this O(nm) complexity of indexof here stackoverflow.com/questions/12752274/… , but when I took a look at source code ( chromium.googlesource.com/v8/v8/+/4.3.49/src/… ), it seemed like they're using some sort of suffix array too.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.