4

I am looking for a hash function table for a set of short length strings (the length of each string is less than 50) with a special feature that whenever we search for a string in this table if the string is inside the table it returns the associated object of that string or specific and unique number, and if the string is not inside the table it gives the ID of very similar string to the input. in order to define similarity between two string, we can define different function, but let assume we define it as the minimum number of operation which needs to convert one string to another. three note:

  1. the length of each inquiry string and saved string are always similar and fix.
  2. the alphabet of string are limited to only 5 different characters.
  3. off course memory and speed are both important for me.

I am not looking for the final solution, but any suggestion or introducing some papers with similar approach in analogous condition are welcome.

12
  • The search term you are looking for is "nearest neighbor under edit distance". Google finds a few papers: I don't know enough to evaluate them. Commented Jan 11, 2015 at 16:06
  • @KarolyHorvath: I suspect it's the length of the string that's less than 50, not the number of strings. Commented Jan 11, 2015 at 16:07
  • pick a generic hash for a string and sort elements in your array. Searching boils down to a binary search. Commented Jan 11, 2015 at 16:08
  • thanks,I edited the question, i mean the length of each string is less than 50 but it is fix in the program. Commented Jan 11, 2015 at 16:11
  • 1
    You are looking for a locality sensitive hash; google is your friend. A LSH is used in approximate nearest neighbor search, which is analogous to what you seem to need. Commented Jan 12, 2015 at 2:56

1 Answer 1

1

If the set of strings is finite and known, conside using GPERF to generate a perfect hash function.

Follow that with a hash of the SOUNDEX(string) which uses a linked list for collision resolution.

Use the first set for exact lookup. Use the second set for near matches.

EDIT: since your alphabet is only 5 characters, you have an opportunity to create a number of hash tables to reduce the size of your N.

Preprocess your key, identifying each unique letter. It might be AAAAA which is a key with A only. AABBBB which is AB or ABCDEF. There are about 5 factorial of those. 120 roots into hash trees, with individual hash functions if needed.

I am a bit concerned because it is unclear where the strings are actually stored. Odds are that your hash will have collisions. How do you intend to resolve that? Hash to a list of seek addresses, which will let your verify the correct hash by reading the actual string off of disk?

Sign up to request clarification or add additional context in comments.

1 Comment

the set of characters are known, but there are any possibility of any combination of these characters for a fixed length.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.