3

A binary string as defined here is fixed size "array" of bits. I call them strings since there is no order on them (sorting/indexing them as numbers has no meaning), each bit is independent of the others. Each such string is N bits long, with N in the hundreds.

I need to store these strings and given a new binary string query for the nearest neighbor using the Hamming distance as the distance metric.
There are specialized data-structures (metric-trees) for metric-based search (VP-trees, cover-trees, M-trees), but I need to use a regular database (MongoDB in my case).

Is there some indexing function that can be applied to the binary strings that can help the DB access only a subset of the records before performing the one-to-one Hamming distance match? Alternatively, how would it be possible to implement such Hamming based search on a standard DB?

2
  • "I call them strings because there is no order on them" - strings have order - lexicographical, specifically. Commented Jul 7, 2011 at 0:50
  • Of course, yet usually sequences of bits are referred to as "numbers", or integers to be exact, which do have a natural order. Commented Jul 7, 2011 at 7:19

2 Answers 2

4

The hamming distance is a metric so it satisfies the triangle inequality. For each bitstring in your database, you could store the it's hamming distance to some pre-defined constant bitstring. Then you can use the triangle inequality to filter out bitstrings in the database.

So let's say

C <- some constant bitstring
S <- bitstring you're trying to find the best match for
B <- a bitstring in the database
distS <- hamming_dist(S,C)
distB <- hamming_dist(B,C)

So for each B, you would store it's corresponding distB.

A lower bound for hamming(B,S) would then be abs(distB-distS). And the upper bound would be distB+distS.

You can discard all B such that the lower bound is higher than the lowest upper bound.

I'm not 100% sure as to the optimal way to choose which C. I think you would want it to be a bitstring that's close to the "center" of your metric space of bitstrings.

Sign up to request clarification or add additional context in comments.

4 Comments

This is equivalent to storing a Bk-tree with only the root node. It's going to help a bit, but it's still going to leave a lot of search space to examine. You could store the distance to multiple strings (ala a FHFQ bk-tree), but that'd require a query with multiple inequalities, which can't be evaluated efficiently on a 1-dimensional index.
@tskuzzy: Thanks, just what I wanted. Just to clarify the reply: Let minDistB be the smallest distB of Bs in the database, then discard all B such that: abs(distB-distS) > distS + minDistB.
@Nick Johnson: You're right (thanks for the BK-Tree post link). Indeed, the obvious extension would be to index against several such Cs, spread around the space and have the database index several Bs, one per C.
@Adi Right. There's no way to satisfy a query with multiple inequalities efficiently against a 1-dimensional index, though, so the savings wouldn't be as large as using an actual tree-structure to store the data.
2

You could use an approach similar to levenshtein automata, applied to bitstrings:

  1. Compute the first (lexicographically smallest) bitstring that would meet your hamming-distance criteria.
  2. Fetch the first result from the database that's greater than or equal to that value
  3. If the value is a match, output it and fetch the next result. Otherwise, compute the next value greater than it that is a match, and goto 2.

This is equivalent to doing a merge join over your database and the list of possible matches, without having to generate every possible match. It'll reduce the search space, but it's still likely to require a significant number of queries.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.