How many times string appears in another string

Question

I have a large static binary (10GB) that doesn't change.

I want to be able to take as input small strings (15 bytes or lower each) and then to determine which string is the least frequent.

I understand that without actually searching the whole binary I wont be able to determine this exactly, so I know it will be an approximation.

Building a tree/hash table isn't feasible since it will require about 256^15 bytes which is ALOT.

I have about 100GB of disk space and 8GB RAM which will be dedicated into this task, but I can't seem to find any way to accomplish this task without actually going over the file.

I have as much time as I want to prepare the big binary, and after that I'll need to decide which is the least frequent string many many times.

Any ideas?

Thanks! Daniel.

(BTW: if it matters, I'm using Python)

Are you sure you really want an approximation? Depending on what kind of file this is, an incomplete sampling could be quite misleading. — Thilo
– Thilo, Commented Apr 21, 2013 at 6:41
Maybe build a hashtable with as many prefixes as you can afford storage for? You can prune the trees that don't appear anymore. I wouldn't call it "approximation", but could be "upper bounds", with assurance to detect strings that don't appear. — Thilo
– Thilo, Commented Apr 21, 2013 at 6:45
I'll have to run the algorithm about 20,000 times each time to decide between about 15 strings (to choose the ideal one). (The big 10gb file will always stay the same). About the hashtable and prefix - I thought about that. I'll answer this as a comment to the answer proposed bellow — Avenger
– Avenger, Commented Apr 21, 2013 at 7:00
This kind of question is usually solved using suffix trees or suffix arrays. Obviously you can't keep all the tree/array in memory but you could "paginate" it. — Bakuriu
– Bakuriu, Commented Apr 21, 2013 at 7:01
The question is what to do about the values - MIN, multiply all the 4-tuples, sum them up... etc — Avenger
– Avenger, Commented Apr 21, 2013 at 7:05

Thilo · Accepted Answer · 2013-04-21 07:05:34Z

1

Maybe build a hashtable with the counts for as many n-tuples as you can afford storage for? You can prune the trees that don't appear anymore. I wouldn't call it "approximation", but could be "upper bounds", with assurance to detect strings that don't appear.

So, say you can build all 4-tuples.

Then to count occurrences for "ABCDEF" you'd have the minimum of count(ABCD), count(BCDE), count(CDEF). If that is zero for any of those, the string is guaranteed to not appear. If it is one, it will appear at most once (but maybe not at all).

edited Apr 21, 2013 at 7:05

answered Apr 21, 2013 at 6:49

Thilo

264k107 gold badges527 silver badges674 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Avenger Over a year ago

I thought about that. Doing MIN (not MAX) is what I thought, but I was thinking I could do it more accurately. If RARE is very rare, then I want to algorithm to clearly prefer "RARE00RARE" more than "RAREabcdef".

Avenger Over a year ago

It wont solve my problem... about the RARE00RARE and RAREabcdef

Thilo Over a year ago

Of course, it only gives upper bounds, but if you make these prefixes long enough, they should be pretty accurate. In fact, for the "rare" cases, it should tend to be more accurate. In your example, if "RARE" only appears once, you can get "RARE00RARE" rejected correctly.

Avenger Over a year ago

Yes, but RARE00RARE should not be rejected, it should be the "chosen string", No?

Thilo Over a year ago

What do you mean by chosen string? Not fixed-length tuples in the index, but varying length depending on rare-ness? If you do that, I am not sure you want the rare ones, though, you'd want the ones that you are going to search for most frequently (which you don't know in advance). If you include "rare" ones instead of "frequent" ones, you get really loose bounds for those (the "rare" ones already have low bounds).

mcdowella · Accepted Answer · 2013-04-21 11:38:10Z

Because you have a large static string that does not change you could distinguish one-time work preprocessing this string which never has to be repeated from the work of answering queries. It might be convenient to do the one-time work on a more powerful machine.

If you can find a machine with an order of magnitude or so more internal storage you could build a suffix array - an array of offsets into the stream in sorted order of the suffixes starting at the offset. This could be stored in external storage for queries, and you could use this with binary search to find the first and last positions in sorted order where your query string appears. Obviously the distance between the two will give you the number of occurrences, and a binary search will need about 34 binary chops to do 16 Gbyte assuming 16Gbytes is 2^34 bytes so each query should cost about 68 disk seeks.

It may not be reasonable to expect you to find that amount of internal storage, but I just bought a 1TB USB hard drive for about 50 pounds, so I think you could increase external storage for one time work. There are algorithms for suffix array construction in external memory, but because your query strings are limited to 15 bytes you don't need anything that complicated. Just create 200GB of data by writing out the 15-byte string found at every offset followed by an 5-byte offset number, then sort these 20-byte records with an external sort. This will give you 50Gbytes of indexes into the string in sorted order for you to put into external storage to answer queries with.

mcdowella · Accepted Answer · 2013-04-21 15:07:33Z

0

If you know all of the queries in advance, or are prepared to batch them up, another approach would be to build an http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm tree from them. This takes time linear in the total size of the queries. Then you can stream the 10GB data past them in time proportional to the sum of the size of that data and the number of times any string finds a match.

answered Apr 21, 2013 at 15:07

mcdowella

19.6k2 gold badges21 silver badges25 bronze badges

Comments

Nuclearman · Accepted Answer · 2013-04-21 15:24:12Z

Since you are looking for which is least frequent, and are willing to accept approximate solution. You could use a series of Bloom filters instead of a hash table. If you use sufficiently large ones, you shouldn't need to worry about the query size, as you can probably keep the false positive rate low.

The idea would be to go through all of the possible query sizes and make sub-strings out of them. For example, if the queries will be between 3 and 100, then it would cost (N * (sum of (i) from i = 3 to i = 100)). Then one by one add the subsets to one of the bloom filters, such that the query doesn't exist within the filter, creating a new one Bloom filter with the same hash functions if needed. You obtain the count by going through each filter and checking if the query exists within it. Each query then simply goes through each of the filter and checks if it's there, if it is, it adds 1 to a count.

You'll need to try to balance the false positive rate as well as the number of filters. If the false positive rate gets too high on one of the filters it isn't useful, likewise it's bad if you have trillions of bloom filters (quite possible if you one filter per sub-string). There are a couple of ways these issues can be dealt with.

To reduce the number of filters:
1. Randomly delete filters until there are only so many left. This will likely increase the false negative rate, which probably means it's better to simply delete the filters with the highest expected false positive rates.
2. Randomly merge filters until there are only so many left. Ideally avoiding merging a filter too often as it increases the false positive rate. Practically speaking, you probably have too many to do this without making use of the scalable version (see below), as it'll probably be hard enough to manage the false positive rate.
3. It also may not be a bad to avoid a greedy approach when adding to a bloom filter. Be rather selective in which filter something is added to.

You might end up having to implement scalable bloom filters to keep things manageable, which sounds similar to what I'm suggesting anyway, so should work well.

Collectives™ on Stack Overflow

How many times string appears in another string

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related