Efficient hash map of constant size string

Question

I have a need to map strings of constant size, which contain just alphanumeric values (A-Z, 0-9, no lower case letters) to other strings. The unordered_map becomes very large (tens of millions of keys), while the mapped values are from a set of a few thousand strings. Upon profiling I found that most time is spent inserting new values into the map (operator[]), and also clearing the map takes a very long time.

std::unordered_map<std::string, std::string> hashMap;
while (...){
    ...
    hashMap[key] = value;  // ~50% of program time is spent here
    ...
}

hashMap.clear();  // Takes a very long time, at this point hashMap.size() > 20,000,000

My thoughts are that the string allocations/deallocations are very slow, as well as hashing and inserting into the map. Any suggestions to optimize this? Keep in mind that the key size is constant, and its contents are limited to a set of 36 characters, and that the mapped values are from a limited set. I'm open to using different container / data types other than strings and unordered_map.

Update

Following a suggestions by Baum Mit Augen I changed my key type to unsigned long long and made a function to convert base 36 to decimal:

unsigned long long ConvertBase36(const char* num)
{
    unsigned long long retVal = 0;

    for (int i = 0; i < 12; i++)
    {
        unsigned int digit = 0;
        char currChar = num[i];
        if (currChar <= '9')
        {
            digit = currChar - '0';
        }
        else
        {
            digit = currChar - 'A' + 10;
        }

        retVal *= 36;
        retVal += digit;
    }

    return retVal;
}

This gave me about 10% improvement in whole program runtime. I then tried to use the unordered_map reserve function again to see if it made any difference and it did not. Trying map instead of unordered_map did about 10% worse so I reverted that change. Finally replacing the string value with an unsigned int made things a bit faster.

How long are the keys? You could convert (substrings) to integer with strtol in base 36. What platform? Have you turned on optimization? — Martin Bonner supports Monica
– Martin Bonner supports Monica, Commented Aug 22, 2016 at 9:50
@EyalK. for 20 million keys, std::map should be a lot slower than std::unordered_map, not "just as bad". It is O(log N) vs amortized O(1) for insertion of single new item. — hyde
– hyde, Commented Aug 22, 2016 at 9:59
@EyalK. Yes, but the map need not make millions of copies of value-strings, but millions of copies of value-integers. The latter is rather cheap. — Baum mit Augen
– Baum mit Augen ♦, Commented Aug 22, 2016 at 10:11
Actually, I just noticed that you keys fit into 64bit ints as log2(36^12) < 64. So if you can come up with a fast injektive function keystring (or char[12]) -> uint64_t, your hash will become identity and the aforementioned function may be faster than std::hash<std::string> because it can use the size information. That would also make std::map faster as integer comparisons are dirt cheap, so you could try that as well. — Baum mit Augen
– Baum mit Augen ♦, Commented Aug 22, 2016 at 10:32
@EyalK. When you're building up the hash table, don't use the index operator. Instead, use insert. Even better would be to call emplace if that's available. You should get some performance better gains because it'll reduce the number allocation/deallocation calls occurring within the implmentation. — Andrew
– Andrew, Commented Aug 22, 2016 at 12:38

Community · Accepted Answer · 2017-05-23 12:15:00Z

4

Two unrelated suggestions, but both related to std::unordered_map::reserve.

First, since your unordered map contains 10Ms of elements, there are probably many re-allocations/rehashes going on as you insert. At the start, you might want to reserve 10Ms of entries.

Since

the mapped values are from a set of a few thousand strings

you should be able to store the values themselves in a secondary unordered_set that you first reserved to something large enough to ensure no iterators get invalidated on inserts - see invalidation guarantees for unordered associative containers.

Your (primary) unordered_map can then map strings to std::unordered_set::const_iterators.

edited May 23, 2017 at 12:15

CommunityBot

11 silver badge

answered Aug 22, 2016 at 10:24

Ami Tavory

76.7k13 gold badges152 silver badges196 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Eyal K. Over a year ago

I tried using reserve but it made no noticeable difference. After reading the docs, it seems this is effectively the same as the previous answer to change the bucket count. I don't think key hashes are clashing, otherwise I would expect to see a difference. I will try to change the value type and see if that does anything

Baum mit Augen Over a year ago

You are probably better of with a secondary vector instead of unordered map and map to indices instead. It safes the map-internal indirection and a vector of a couple thousand strings easily fits into L3 cache (and maybe in L2 if you are lucky) on many processors. Needs measurement though of course.

Ami Tavory Over a year ago

@BaummitAugen That's an interesting suggestion, thanks (tree with eyes? I hardly speak German).

Baum mit Augen Over a year ago

"tree with eyes" That's exactly what it means. :)

Yam Marcovic · Accepted Answer · 2016-08-22 09:57:19Z

0

For such a large number of entries, you might want to consider playing around with the number of buckets in your hash.

For starters, you can query your implementation-defined value with this:

unordered_map<T, U> um; cout << um.bucket_count();

Then you can play around and see which value produces the best results:

const size_t N = 100; unordered_map<T, U> m(N);

answered Aug 22, 2016 at 9:57

Yam Marcovic

8,1511 gold badge31 silver badges39 bronze badges

1 Comment

Eyal K. Over a year ago

The implementation defined bucket count is 8. Changing it to 100 didn't make any noticeable difference.

Collectives™ on Stack Overflow

Efficient hash map of constant size string

Update

2 Answers 2

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Update

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related