Hash function for strings in C

Question

I am currently trying to implement a hash function for my program in C. I have found many possible solutions but I do not understand them. The following is the hash function:

int hash(const char *word) {
    int hash = 0;
    int n;
    for (int i = 0; word[i] != '\0'; i++) {
        // alphabet case
        if (isalpha(word[i]))
            n = word[i] - 'a' + 1;
        else  // comma case
            n = 27;

        hash = ((hash << 3) + n) % SIZE;
    }
    return hash;
}

Why are we subtracting 'a'+1 from word[i]? Also, why are we doing the following: hash = ((hash << 3) + n) % SIZE?

"Why are we adding 'a'+1 to the string?" - presumably so n doesn't end up being 0. — Michael Burr
– Michael Burr, Commented Dec 9, 2013 at 4:03

Sergey Kalinichenko · Accepted Answer · 2013-12-09 04:15:43Z

2

Why are we adding 'a'+1 to the string?

We are not adding, we are subtracting. Moreover, we aren't doing it to the string, we do it to one character at a time.

Here is what it does, according to the authors's intentions: given a letter from a to z, the expression produces the sequence number of that letter: 'a' produces 1, 'b' produces 2, 'c' produces 3, and so on.

Unfortunately, this implementation is broken: when the letter is in upper case, isalpha returns true, but the result of the expression does not give you the letter number. In fact, if your computer uses encoding that is consistent with ASCII codes, the result would be a negative number.

why are we doing the following: hash = ((hash << 3) + n) % SIZE

This multiplies the prior value of hash by eight (shift by three is the same as multiplying by eight), adds the number of the letter, and then limits the value by obtaining the remainder of division by SIZE.

Since the actual value of the hash code is of very little interest, as long as it is sensitive to small changes in the word, you could use this function instead:

int hash (const char* word)
{
    unsigned int hash = 0;
    for (int i = 0 ; word[i] != '\0' ; i++)
    {
        hash = 31*hash + word[i];
    }
    return hash % SIZE;
}

This algorithm (without the SIZE limit) is used for calculating hash codes of Strings in Java. It is very simple and very efficient.

edited Dec 9, 2013 at 4:15

answered Dec 9, 2013 at 4:05

Sergey Kalinichenko

729k85 gold badges1.2k silver badges1.6k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jim Balter Over a year ago

"In fact, the result would be a negative number" -- it might or might not ... C doesn't mandate any particular encoding, so the uppercase letters might have larger values than the lowercase letters.

Sergey Kalinichenko Over a year ago

@JimBalter You're right, I didn't think about EBSDIC, where uppercase letters have higher codes than lowercase ones. Thanks!

chqrlie Over a year ago

i should have type size_t and it might be better to cast word[i] as unsigned char.

Jason Heo · Accepted Answer · 2013-12-09 04:05:19Z

1

Why are we adding 'a'+1 to the string?

if we don't add "+1", hash("a") = hash("aa") = has("aaa") ... check below code
```
char alpha = 'a';
printf("%d\n", alpha - 'a' + 1); // <= produces '1'
```
why are we doing the following: "hash = ((hash << 3) + n) % SIZE"?
```
hash = ((hash * 8) + n ) % SIZE
```

answered Dec 9, 2013 at 4:05

Jason Heo

10.3k2 gold badges39 silver badges72 bronze badges

Comments

Jim Balter · Accepted Answer · 2013-12-09 04:28:52Z

1

Why are we adding 'a'+1 to the string?

We aren't ... - means subtract, not add, and word[i] is a character of the string, not the string. So we're subtracting 'a' and adding 1 to each character of the string.

If word[i] is a lower case letter, then word[i] - 'a' + 1 calculates the number of that letter: 'a' -> 1, ... 'z' -> 26. What if it isn't a lower case letter? Well, non-alphabetic characters (not just comma, contrary to the comment) are mapped to 27, but upper case letters, if present, result in undefined behavior.

"hash = ((hash << 3) + n) % SIZE"?

This multiplies the previous hash value by 8, then adds the value 1 ... 27 for the current character, and guarantees that the result doesn't exceed SIZE, which presumably is the number of hash buckets. If the string contains more characters than the word size / 3, the initial characters will be shifted out. If SIZE is a power of 2 and the string has more than SIZE/3 characters, then all of those additional characters will be shifted out.

That's how it works, but it's not a very good hash function. Aside from the code having an erroneous comment and not handling upper case letters, it also doesn't handle long strings well, because the initial characters will get shifted out, as mentioned. Also the shift and add operation combines adjacent characters in a non-random way, so it will produce more hash bucket collisions than the optimum. This hash function is fast, but there are better fast hash functions. See https://en.wikipedia.org/wiki/Hash_function for more information.

edited Dec 9, 2013 at 4:28

answered Dec 9, 2013 at 4:04

Jim Balter

16.5k3 gold badges46 silver badges70 bronze badges

4 Comments

Tim Pierce Over a year ago

Well, comma is handled but nothing else is, so this has function assumes that the string it is hashing consists entirely of lower case letters and commas; if it contains anything else, the results are undefined. Eh? There are not any undefined cases that I see here. If the char is alphabetic, it gets mapped into the integers 1-26. If it is any other char, it gets hashed to 27. The comment says "comma case" but there is nothing in the code that treats commas specially.

Jim Balter Over a year ago

@qwrrty You're right about the mismatch between the comma and the code, but you're wrong about there being no undefined cases ... uppercase letters, for which isalpha is true, do not get mapped to 1-26; what they do get mapped to is undefined, since C doesn't mandate any particular encoding.

Tim Pierce Over a year ago

@JimBalter Thanks for the correction on the uppercase letters, careless of me to miss that. KishB87: a good hash function will map its input strings fairly evenly to numbers across the hash space. Defining a function that will do this well is not simple, but it will at least gracefully handle all possible characters in its input, which this function does not.

chqrlie Over a year ago

There is indeed potential undefined behavior, but not where you describe it. If the char type is signed, isalpha(word[i]) has undefined behavior for negative char values. To avoid this problem, the argument to isalpha must be cast as unsigned char: isalpha((unsigned char)word[i]). There is more undefined behavior here: hash = ((hash << 3) + n) % SIZE: left shifting negative values is undefined behavior. You can have a negative value for hash if the first character is an upper-case letter. Change the type of hash and c to unsigned int to avoid this.

chqrlie · Accepted Answer · 2017-02-05 21:21:23Z

The subtraction is an attempt to convert the lowercase letters to numbers from 1 to 26. The comma is converted to 27 but the upper-case letters are converted to negative values (for the ASCII character set), which has bad side effects.

Indeed there is potential undefined behavior:

If the char type is signed, isalpha(word[i]) has undefined behavior for negative char values. To avoid this problem, the argument to isalpha must be cast as unsigned char: isalpha((unsigned char)word[i]).
hash = ((hash << 3) + n) % SIZE has potential undefined behavior too: left shifting negative values is undefined behavior. You can have a negative value for hash if the first character is an upper-case letter. Change the type of hash and c to unsigned int to avoid this.

The expression hash = ((hash << 3) + n) % SIZE is used to combine the bits of all characters into a value between 0 and SIZE-1. Note however that if SIZE is not an unsigned value, the expression might produce a negative value between -SIZE+1 and -1, which would probably have bad side effects.

Transcoding the character values does not really help producing a good hash function.

Here is an safer version:

#include <limits.h>

unsigned int hash(const char *word) {
    unsigned int hash = 0, c;

    for (size_t i = 0; word[i] != '\0'; i++) {
        c = (unsigned char)word[i];
        hash = (hash << 3) + (hash >> (sizeof(hash) * CHAR_BIT - 3)) + c;
    }
    return hash % SIZE;
}

Collectives™ on Stack Overflow

Hash function for strings in C

4 Answers 4

3 Comments

Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related