Simple Hashmap in C

Question

I've been working on a simple hashmap in C. Here is hashmap.h:

#ifndef __HASHMAP_H__
#define __HASHMAP_H__

#include <stdlib.h>
#include <stdbool.h>

typedef size_t (*HashFunction)(void*);
typedef bool (*CompareFunction)(void*, void*);

struct Pair {
    size_t hash_id;
    void* key;
    void* value;
    struct Pair* next;
};

struct HashMap {
    struct Pair** buckets;
    size_t num_buckets;
    HashFunction hfunc;
    CompareFunction cfunc;
};

struct HashMap* new_hashmap(HashFunction, CompareFunction);
struct HashMap* new_hashmap_c(HashFunction, CompareFunction, size_t);

void free_hashmap(struct HashMap*);

void insert_hashmap(struct HashMap*, void*, void*);

void* get_hashmap(struct HashMap*, void*);
void remove_hashmap(struct HashMap*, void*);

#endif // __HASHMAP_H__

And here is hashmap.c:

#include <assert.h>
#include "hashmap.h"

#define DEFAULT_BUCKETS 750000

struct HashMap* new_hashmap(HashFunction hfunc, CompareFunction cfunc) {
    return new_hashmap_c(hfunc, cfunc, DEFAULT_BUCKETS);
}

struct HashMap* new_hashmap_c(HashFunction hfunc, CompareFunction cfunc, size_t buckets) {
    struct HashMap* hmap;
    hmap = malloc(sizeof(*hmap));
    assert(hmap);

    hmap->buckets = malloc(buckets * sizeof(*hmap->buckets));
    hmap->num_buckets = buckets;
    hmap->hfunc = hfunc;
    hmap->cfunc = cfunc;

    assert(hmap->buckets);

    for (size_t i = 0; i < buckets; ++i) {
        hmap->buckets[i] = NULL;
    }

    return hmap;
}

void free_hashmap(struct HashMap* hmap) {
    free(hmap->buckets);
    free(hmap);
    hmap = NULL;
}

void insert_hashmap(struct HashMap* hmap, void* key, void* value) {
    size_t hashed_key = hmap->hfunc(key);
    struct Pair* prev = NULL;
    struct Pair* entry = hmap->buckets[hashed_key];

    while (entry != NULL) {
        if (hmap->cfunc(entry->key, key)) {
            prev = entry;
            break;
        }
        entry = entry->next;
    }

    if (entry == NULL) {
        entry = malloc(sizeof(struct Pair));
        entry->hash_id = hashed_key;
        entry->key = key;
        entry->value = value;
        entry->next = NULL;

        if (prev == NULL) {
            hmap->buckets[hashed_key] = entry;
        } else {
            prev->next = entry;
        }
    } else {
        entry->value = value;
    }
}

void* get_hashmap(struct HashMap* hmap, void* key) {
    size_t hashed_key = hmap->hfunc(key);
    struct Pair* entry = hmap->buckets[hashed_key];

    while (entry != NULL) {
        if (hmap->cfunc(entry->key, key)) return entry->value;
        entry = entry->next;
    }

    return NULL;
}

void remove_hashmap(struct HashMap* hmap, void* key) {
    size_t hashed_key = hmap->hfunc(key);
    struct Pair* prev = NULL;
    struct Pair* entry = hmap->buckets[hashed_key];

    while (entry != NULL) {
        if (hmap->cfunc(entry->key, key)) {
            prev = entry;
            break;
        }
        entry = entry->next;
    }

    if (entry == NULL) return;
    if (prev == NULL) {
        hmap->buckets[hashed_key] = entry->next;
    } else {
        prev->next = entry->next;
    }
    free(entry);
    hmap->buckets[hashed_key] = NULL;
}

And here is my test code:

#include <stdio.h>
#include <string.h>
#include "hashmap.h"

size_t hash(void* key) {
    size_t hash = 0;
    for (size_t i = 0; i < strlen(key); i++) {
        hash = 31 * hash + *((char*) (key + i));
    }

    return hash;
}

bool compare(void* key1, void* key2) {
    return *((int*) key1) == *((int*)key2);
}

int main() {
    struct HashMap* my_hmap = new_hashmap(hash, compare);
    int k = 10;
    int v = 101;
    int v2 = 102;

    insert_hashmap(my_hmap, &k, &v);
    printf("initial value: %d\n", *((int*)get_hashmap(my_hmap, &k)));

    insert_hashmap(my_hmap, &k, &v2);
    printf("value after changing: %d\n", *((int*)get_hashmap(my_hmap, &k)));

    remove_hashmap(my_hmap, &k);
    printf("pointer to deleted value: %p\n", get_hashmap(my_hmap, &k));

    free_hashmap(my_hmap);
    printf("done!");
}

Any tips on performance, coding style, etc. would be nice. Thanks!

That's an awful lot of default buckets. There are numerous bugs, too. Try your code with a hash function that returns a constant (i.e., every insert results in a hash collision). Then try another where that constant is, say, 1000000 (some number greater than DEFAULT_BUCKETS). In your test code, why is a map that stores integers using a string based hash computation? — 1201ProgramAlarm
– 1201ProgramAlarm, Commented Dec 15, 2020 at 1:12
@1201ProgramAlarm -- I seem to have gotten the incorrect hash function. Also, I did expect there to be bugs in the code; It's the first time I have ever written a hashmap and I haven't written C code in a while. — xilpex
– xilpex, Commented Dec 15, 2020 at 1:19
I would change the number of buckets to: 749993 closest prime to 750000 this will help you avoid collisions. — Loki Astari
– Loki Astari, Commented Dec 16, 2020 at 18:17

Roland Illig · Accepted Answer · 2020-12-15 01:32:20Z

Urgs, the first impression is already bad:

#ifndef __HASHMAP_H__

Every identifier that starts with a double underscore is reserved for the implementation of the C compilation environment (compiler + operating system). Since you are not working on the implementation but are rather an application developer, you must not define these identifiers. Instead, use the commonly accepted pattern of PROJECT_FILE_H, with PROJECT and FILE being placeholders.

#include <stdlib.h>
#include <stdbool.h>

It is common to sort the headers alphabetically, at least those from the C standard library. Both stdlib.h and stdbool.h come from the standard library.

typedef size_t (*HashFunction)(void*);
typedef bool (*CompareFunction)(void*, void*);

These typedefs are almost reasonable. Since these functions are not supposed to modify their arguments, you should replace void * with const void *.

struct Pair {
    size_t hash_id;
    void* key;
    void* value;
    struct Pair* next;
};

The name Pair is wrong for this struct, it should rather be called HashMapEntry. A pair typically has 2 fields, not 4, and these fields are called first and second.

struct HashMap {
    struct Pair** buckets;
    size_t num_buckets;
    HashFunction hfunc;
    CompareFunction cfunc;
};

To quickly see if the map is empty (if you need it), you could add a size_t size field.

struct HashMap* new_hashmap(HashFunction, CompareFunction);
struct HashMap* new_hashmap_c(HashFunction, CompareFunction, size_t);

All these function declarations look good. To get rid of the struct keyword, you should typedef struct HashMap HashMap;.

Next, the implementation.

#include <assert.h>
#include "hashmap.h"

Looks great. Putting assertions all over the code is good style. That is much better than comments of the kind "I expect that this variable is never NULL".

#define DEFAULT_BUCKETS 750000

Whoa, that's a high number. Other HashMap implementations typically use a default bucket count of 16.

struct HashMap* new_hashmap_c(HashFunction hfunc, CompareFunction cfunc, size_t buckets) {
    struct HashMap* hmap;
    hmap = malloc(sizeof(*hmap));
    assert(hmap);

This assertion is dangerous since it goes away when you compile the code with the preprocessor flag -DNDEBUG. You must either ensure that your code is always compiled with assertions enabled, or if (hmap == NULL) enomem().

    hmap->buckets = malloc(buckets * sizeof(*hmap->buckets));

This multiplication might overflow if the number of buckets is really high. This will probably not happen though. If your HashMap has to cope with untrusted input, you risk a buffer overflow and a security vulnerable, allowing every attacker to run arbitrary code on your computer.

The rest of new_hashmap_c looks fine, except for the assert.

void free_hashmap(struct HashMap* hmap) {
    free(hmap->buckets);
    free(hmap);
    hmap = NULL;
}

There is no point in setting hmap = NULL at the end. This will only influence the local variable hmap inside the function free_hashmap. If the calling function has a variable, that variable will not be influenced at all and still point to the freed memory. That's ok because the caller is not supposed to do anything with that variable anymore.

void insert_hashmap(struct HashMap* hmap, void* key, void* value) {
    size_t hashed_key = hmap->hfunc(key);
    struct Pair* prev = NULL;
    struct Pair* entry = hmap->buckets[hashed_key];

That's bad design. Your HashFunction is supposed to return a number in the range [0, hmap->num_buckets), but the field num_buckets should not be accessed by any code outside your own implementation. Therefore the insert_hashmap should ensure for itself that the return value of the hash function is in the correct range, using this simple one-line change:

    struct Pair* entry = hmap->buckets[hashed_key % hmap->num_buckets];

The rest of insert_hashmap looks fine, except for the missing null check after malloc, as explained above.

The function get_hashmap looks perfect.

The function remove_hashmap looks perfect.

Next and last: the test code.

size_t hash(void* key) {
    ...
        hash = 31 * hash + *((char*) (key + i));

I wonder how you got the compiler to accept this broken code. key is a void pointer, and one cannot do arithmetic using void pointers. The usual way to deal with this is to convert the void pointer into a string pointer first:

    for (size_t i = 0; i < strlen(key); i++) {

Calling strlen in a loop is expensive. This is because in C, the function strlen is very inefficient. It needs to look at every character of a string until it finds the '\0' that terminates the string. If you have a string with 1_000_000 characters, this will take 500_000_000_000 memory accesses, which is really slow. The usual pattern in C is to start with a pointer to the beginning of the string and to advance this pointer until it points to the '\0', like this:

size_t hash(const void* key) {
    size_t hash = 0;
    for (const unsigned char *p = key; *p != '\0'; p++) {
        hash = 31 * hash + *p;
    }
    return hash;
}

bool compare(void* key1, void* key2) {
    return *((int*) key1) == *((int*)key2);
}

Ah, so the HashMap maps string keys to int values. These facts should be represented in the function names. Instead of hash and compare, these should rather be called hash_str and compare_int.

int main() {

The empty parentheses () mean that this function takes an arbitrary number of arguments. This was a historical accident. In modern C (since 1990), you have to write (void) here to state that the function takes no arguments at all.

    struct HashMap* my_hmap = new_hashmap(hash, compare);
    int k = 10;
    int v = 101;
    int v2 = 102;

    insert_hashmap(my_hmap, &k, &v);

Nope. As I said above, the code says that the keys to the map are strings, yet you pass an int to it. This invokes undefined behavior, which you must avoid.

    printf("done!");

There is a \n missing after the "done!". Without this newline, there is no guarantee that the "done!" is printed at all. That's for historic reasons, but it's a good rule. Every program that outputs text should do so in whole lines.

Summary: you got many things right and many things wrong, but that's to be expected when you submit your code for review.

For comparison, here is the header and implementation that's very similar to yours. You can practice reading other people's code using that, and if you find any differences, have a look at the GitHub history of these files. They are more than 27 years old and have evolved a lot in all this time.

Thanks for the in-depth review! In the test code, I am mapping ints to ints, rather than strings to ints; I accidentally got a string hash function. — xilpex
– xilpex, Commented Dec 15, 2020 at 1:48
No, not quite. In the test code you are pretending to map ints to ints. In reality you just invoke undefined behavior. — Roland Illig
– Roland Illig, Commented Dec 15, 2020 at 1:51
I apologize for my ignorance, but how does it invoke undefined behavior? — xilpex
– xilpex, Commented Dec 15, 2020 at 1:54
By calling strlen on an int pointer. I don't think the C standard makes any guarantee for this. — Roland Illig
– Roland Illig, Commented Dec 15, 2020 at 5:23

chux · Accepted Answer · 2020-12-15 11:47:36Z

Use a prime number for the number of buckets.

To expand @RoRoland Illig idea about % hmap->num_buckets.

Insure the array index is within range by modding with the number of buckets:

size_t hashed_key = hmap->hfunc(key);
// struct Pair* entry = hmap->buckets[hashed_key];
struct Pair* entry = hmap->buckets[hashed_key % hmap->num_buckets];

If hmap->hfunc() is a good hash function, then the value of hmap->num_buckets makes little difference.

Yet if hmap->hfunc() has weaknesses, performing % some_prime improves the hashing.

A bucket number of 750000 or 0xB71B0 emphasizes the last four bits of hashed_key over the other bits. A prime would nominally equally use all the bits of hashed_key.

Stack Exchange Network

Simple Hashmap in C

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Simple Hashmap in C

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions