Storing and indexing binary strings in a database

Question

A binary string as defined here is fixed size "array" of bits. I call them strings since there is no order on them (sorting/indexing them as numbers has no meaning), each bit is independent of the others. Each such string is N bits long, with N in the hundreds.

I need to store these strings and given a new binary string query for the nearest neighbor using the Hamming distance as the distance metric.
There are specialized data-structures (metric-trees) for metric-based search (VP-trees, cover-trees, M-trees), but I need to use a regular database (MongoDB in my case).

Is there some indexing function that can be applied to the binary strings that can help the DB access only a subset of the records before performing the one-to-one Hamming distance match? Alternatively, how would it be possible to implement such Hamming based search on a standard DB?

"I call them strings because there is no order on them" - strings have order - lexicographical, specifically. — Nick Johnson
– Nick Johnson, Commented Jul 7, 2011 at 0:50
Of course, yet usually sequences of bits are referred to as "numbers", or integers to be exact, which do have a natural order. — Adi Shavit
– Adi Shavit, Commented Jul 7, 2011 at 7:19

tskuzzy · Accepted Answer · 2011-07-06 15:58:25Z

4

The hamming distance is a metric so it satisfies the triangle inequality. For each bitstring in your database, you could store the it's hamming distance to some pre-defined constant bitstring. Then you can use the triangle inequality to filter out bitstrings in the database.

So let's say

C <- some constant bitstring
S <- bitstring you're trying to find the best match for
B <- a bitstring in the database
distS <- hamming_dist(S,C)
distB <- hamming_dist(B,C)

So for each B, you would store it's corresponding distB.

A lower bound for hamming(B,S) would then be abs(distB-distS). And the upper bound would be distB+distS.

You can discard all B such that the lower bound is higher than the lowest upper bound.

I'm not 100% sure as to the optimal way to choose which C. I think you would want it to be a bitstring that's close to the "center" of your metric space of bitstrings.

answered Jul 6, 2011 at 15:58

tskuzzy

36.6k15 gold badges78 silver badges145 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Nick Johnson Over a year ago

This is equivalent to storing a Bk-tree with only the root node. It's going to help a bit, but it's still going to leave a lot of search space to examine. You could store the distance to multiple strings (ala a FHFQ bk-tree), but that'd require a query with multiple inequalities, which can't be evaluated efficiently on a 1-dimensional index.

Adi Shavit Over a year ago

@tskuzzy: Thanks, just what I wanted. Just to clarify the reply: Let minDistB be the smallest distB of Bs in the database, then discard all B such that: abs(distB-distS) > distS + minDistB.

Adi Shavit Over a year ago

@Nick Johnson: You're right (thanks for the BK-Tree post link). Indeed, the obvious extension would be to index against several such Cs, spread around the space and have the database index several Bs, one per C.

Nick Johnson Over a year ago

@Adi Right. There's no way to satisfy a query with multiple inequalities efficiently against a 1-dimensional index, though, so the savings wouldn't be as large as using an actual tree-structure to store the data.

Nick Johnson · Accepted Answer · 2011-07-07 00:55:34Z

2

You could use an approach similar to levenshtein automata, applied to bitstrings:

Compute the first (lexicographically smallest) bitstring that would meet your hamming-distance criteria.
Fetch the first result from the database that's greater than or equal to that value
If the value is a match, output it and fetch the next result. Otherwise, compute the next value greater than it that is a match, and goto 2.

This is equivalent to doing a merge join over your database and the list of possible matches, without having to generate every possible match. It'll reduce the search space, but it's still likely to require a significant number of queries.

answered Jul 7, 2011 at 0:55

Nick Johnson

101k17 gold badges131 silver badges198 bronze badges

Collectives™ on Stack Overflow

Storing and indexing binary strings in a database

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related