hash function for string

Question

I am looking for a hash function table for a set of short length strings (the length of each string is less than 50) with a special feature that whenever we search for a string in this table if the string is inside the table it returns the associated object of that string or specific and unique number, and if the string is not inside the table it gives the ID of very similar string to the input. in order to define similarity between two string, we can define different function, but let assume we define it as the minimum number of operation which needs to convert one string to another. three note:

the length of each inquiry string and saved string are always similar and fix.
the alphabet of string are limited to only 5 different characters.
off course memory and speed are both important for me.

I am not looking for the final solution, but any suggestion or introducing some papers with similar approach in analogous condition are welcome.

The search term you are looking for is "nearest neighbor under edit distance". Google finds a few papers: I don't know enough to evaluate them. — Igor Tandetnik
– Igor Tandetnik, Commented Jan 11, 2015 at 16:06
@KarolyHorvath: I suspect it's the length of the string that's less than 50, not the number of strings. — Igor Tandetnik
– Igor Tandetnik, Commented Jan 11, 2015 at 16:07
pick a generic hash for a string and sort elements in your array. Searching boils down to a binary search. — Raxvan
– Raxvan, Commented Jan 11, 2015 at 16:08
thanks,I edited the question, i mean the length of each string is less than 50 but it is fix in the program. — user2373198
– user2373198, Commented Jan 11, 2015 at 16:11
You are looking for a locality sensitive hash; google is your friend. A LSH is used in approximate nearest neighbor search, which is analogous to what you seem to need. — Stefan Atev
– Stefan Atev, Commented Jan 12, 2015 at 2:56

EvilTeach · Accepted Answer · 2015-01-11 18:31:26Z

1

If the set of strings is finite and known, conside using GPERF to generate a perfect hash function.

Follow that with a hash of the SOUNDEX(string) which uses a linked list for collision resolution.

Use the first set for exact lookup. Use the second set for near matches.

EDIT: since your alphabet is only 5 characters, you have an opportunity to create a number of hash tables to reduce the size of your N.

Preprocess your key, identifying each unique letter. It might be AAAAA which is a key with A only. AABBBB which is AB or ABCDEF. There are about 5 factorial of those. 120 roots into hash trees, with individual hash functions if needed.

I am a bit concerned because it is unclear where the strings are actually stored. Odds are that your hash will have collisions. How do you intend to resolve that? Hash to a list of seek addresses, which will let your verify the correct hash by reading the actual string off of disk?

edited Jan 11, 2015 at 18:31

answered Jan 11, 2015 at 16:20

EvilTeach

29k21 gold badges89 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user2373198 Over a year ago

the set of characters are known, but there are any possibility of any combination of these characters for a fixed length.

Collectives™ on Stack Overflow

hash function for string

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related