5

I am currently trying to implement a hash function for my program in C. I have found many possible solutions but I do not understand them. The following is the hash function:

int hash(const char *word) {
    int hash = 0;
    int n;
    for (int i = 0; word[i] != '\0'; i++) {
        // alphabet case
        if (isalpha(word[i]))
            n = word[i] - 'a' + 1;
        else  // comma case
            n = 27;

        hash = ((hash << 3) + n) % SIZE;
    }
    return hash;
}

Why are we subtracting 'a'+1 from word[i]? Also, why are we doing the following: hash = ((hash << 3) + n) % SIZE?

1
  • 1
    "Why are we adding 'a'+1 to the string?" - presumably so n doesn't end up being 0. Commented Dec 9, 2013 at 4:03

4 Answers 4

2

Why are we adding 'a'+1 to the string?

We are not adding, we are subtracting. Moreover, we aren't doing it to the string, we do it to one character at a time.

Here is what it does, according to the authors's intentions: given a letter from a to z, the expression produces the sequence number of that letter: 'a' produces 1, 'b' produces 2, 'c' produces 3, and so on.

Unfortunately, this implementation is broken: when the letter is in upper case, isalpha returns true, but the result of the expression does not give you the letter number. In fact, if your computer uses encoding that is consistent with ASCII codes, the result would be a negative number.

why are we doing the following: hash = ((hash << 3) + n) % SIZE

This multiplies the prior value of hash by eight (shift by three is the same as multiplying by eight), adds the number of the letter, and then limits the value by obtaining the remainder of division by SIZE.

Since the actual value of the hash code is of very little interest, as long as it is sensitive to small changes in the word, you could use this function instead:

int hash (const char* word)
{
    unsigned int hash = 0;
    for (int i = 0 ; word[i] != '\0' ; i++)
    {
        hash = 31*hash + word[i];
    }
    return hash % SIZE;
}

This algorithm (without the SIZE limit) is used for calculating hash codes of Strings in Java. It is very simple and very efficient.

Sign up to request clarification or add additional context in comments.

3 Comments

"In fact, the result would be a negative number" -- it might or might not ... C doesn't mandate any particular encoding, so the uppercase letters might have larger values than the lowercase letters.
@JimBalter You're right, I didn't think about EBSDIC, where uppercase letters have higher codes than lowercase ones. Thanks!
i should have type size_t and it might be better to cast word[i] as unsigned char.
1
  • Why are we adding 'a'+1 to the string?

    if we don't add "+1", hash("a") = hash("aa") = has("aaa") ... check below code

    char alpha = 'a';
    printf("%d\n", alpha - 'a' + 1); // <= produces '1'
    
  • why are we doing the following: "hash = ((hash << 3) + n) % SIZE"?

    hash = ((hash * 8) + n ) % SIZE
    

Comments

1

Why are we adding 'a'+1 to the string?

We aren't ... - means subtract, not add, and word[i] is a character of the string, not the string. So we're subtracting 'a' and adding 1 to each character of the string.

If word[i] is a lower case letter, then word[i] - 'a' + 1 calculates the number of that letter: 'a' -> 1, ... 'z' -> 26. What if it isn't a lower case letter? Well, non-alphabetic characters (not just comma, contrary to the comment) are mapped to 27, but upper case letters, if present, result in undefined behavior.

"hash = ((hash << 3) + n) % SIZE"?

This multiplies the previous hash value by 8, then adds the value 1 ... 27 for the current character, and guarantees that the result doesn't exceed SIZE, which presumably is the number of hash buckets. If the string contains more characters than the word size / 3, the initial characters will be shifted out. If SIZE is a power of 2 and the string has more than SIZE/3 characters, then all of those additional characters will be shifted out.

That's how it works, but it's not a very good hash function. Aside from the code having an erroneous comment and not handling upper case letters, it also doesn't handle long strings well, because the initial characters will get shifted out, as mentioned. Also the shift and add operation combines adjacent characters in a non-random way, so it will produce more hash bucket collisions than the optimum. This hash function is fast, but there are better fast hash functions. See https://en.wikipedia.org/wiki/Hash_function for more information.

4 Comments

Well, comma is handled but nothing else is, so this has function assumes that the string it is hashing consists entirely of lower case letters and commas; if it contains anything else, the results are undefined. Eh? There are not any undefined cases that I see here. If the char is alphabetic, it gets mapped into the integers 1-26. If it is any other char, it gets hashed to 27. The comment says "comma case" but there is nothing in the code that treats commas specially.
@qwrrty You're right about the mismatch between the comma and the code, but you're wrong about there being no undefined cases ... uppercase letters, for which isalpha is true, do not get mapped to 1-26; what they do get mapped to is undefined, since C doesn't mandate any particular encoding.
@JimBalter Thanks for the correction on the uppercase letters, careless of me to miss that. KishB87: a good hash function will map its input strings fairly evenly to numbers across the hash space. Defining a function that will do this well is not simple, but it will at least gracefully handle all possible characters in its input, which this function does not.
There is indeed potential undefined behavior, but not where you describe it. If the char type is signed, isalpha(word[i]) has undefined behavior for negative char values. To avoid this problem, the argument to isalpha must be cast as unsigned char: isalpha((unsigned char)word[i]). There is more undefined behavior here: hash = ((hash << 3) + n) % SIZE: left shifting negative values is undefined behavior. You can have a negative value for hash if the first character is an upper-case letter. Change the type of hash and c to unsigned int to avoid this.
0

The subtraction is an attempt to convert the lowercase letters to numbers from 1 to 26. The comma is converted to 27 but the upper-case letters are converted to negative values (for the ASCII character set), which has bad side effects.

Indeed there is potential undefined behavior:

  • If the char type is signed, isalpha(word[i]) has undefined behavior for negative char values. To avoid this problem, the argument to isalpha must be cast as unsigned char: isalpha((unsigned char)word[i]).

  • hash = ((hash << 3) + n) % SIZE has potential undefined behavior too: left shifting negative values is undefined behavior. You can have a negative value for hash if the first character is an upper-case letter. Change the type of hash and c to unsigned int to avoid this.

The expression hash = ((hash << 3) + n) % SIZE is used to combine the bits of all characters into a value between 0 and SIZE-1. Note however that if SIZE is not an unsigned value, the expression might produce a negative value between -SIZE+1 and -1, which would probably have bad side effects.

Transcoding the character values does not really help producing a good hash function.

Here is an safer version:

#include <limits.h>

unsigned int hash(const char *word) {
    unsigned int hash = 0, c;

    for (size_t i = 0; word[i] != '\0'; i++) {
        c = (unsigned char)word[i];
        hash = (hash << 3) + (hash >> (sizeof(hash) * CHAR_BIT - 3)) + c;
    }
    return hash % SIZE;
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.