Long Binary Array Compression

Question

There is an Array of Binary Numbers and the number of elements in this array is around 10^20.

The number of "ones" in the array is around 10^10, and these numbers are randomly distributed.

Once the data is generated and saved, it won't be edited: it will remain in read-only mode in its whole life cycle.

Having this data saved, requests will be received. Each request contains the index of the array, and the response should be the value in that particular index. The indexes of these requests are not in order (they may be random).

Question is: how to encode this info saving space and at the same time have a good performance when serving requests?

My thoughts so far are:

To have an array of indexes for each of the "ones". So, I would have an array of 10^10 elements, containing indexes in the range: 0 - 10^20. Maybe not the best compression method, but, it is easy to decode.
The optimal solution in compression: to enumerate each of the combinations (select 10^10 numbers from a set of 10^20 available numbers), then, the data is just the "id" of this enumeration... but, this could be a problem to decode, I think.

I'm almost tempted to close as a duplicate of: stackoverflow.com/q/9753806/179910. The exact numbers involved are different, but the general idea is nearly identical (and I think with minor adjustments, my solution to that problem works about equally well here). — Jerry Coffin
– Jerry Coffin, Commented Dec 17, 2014 at 2:41

Mark Adler · Accepted Answer · 2014-12-18 19:26:17Z

1

Look up "sparse array". If access speed is important, a good solution is a hash table of indices. You should allocate about 2x the space, requiring a 180 GB table. The access time would be O(1).

You could have just a 90 GB table and do a binary search for an index. The access time would be O(log n), if you're happy with that speed.

You can pack the indices more tightly, to less than 84 GB to minimize the size of the single-table approach.

You can break it up into multiple tables. E.g. if you had eight tables, each representing the possible high three bits of the index, then the tables would take 80 GB.

You can break it up further. E.g. if you have 2048 tables, each representing the high 11 bits of the index, the total would be 70 GB, plus some very small amount for the table of pointers to the sub-tables.

Even further, with 524288 tables, you can do six bytes per entry for 60 GB, plus the table of tables overhead. That would still be small in comparison, just megabytes.

The next multiple of 256 should still be a win. With 134 million subtables, you could get it down to 50 GB, plus less than a GB for the table of tables. So less than 51 GB. Then you could, for example, keep the table of tables in memory, and load a sub-table into memory for each binary search. You could have a cache of sub-tables in memory, throwing out old ones when you run out of space. Each sub-table would have, on average, only 75 entries. Then the binary search is around seven steps, after one step to find the sub-table. Most of the time will be spent getting the sub-tables into memory, assuming that you don't have 64 GB of RAM. Then again, maybe you do.

edited Dec 18, 2014 at 19:26

answered Dec 17, 2014 at 21:03

Mark Adler

115k15 gold badges136 silver badges182 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Mark Ransom Over a year ago

This isn't just a sparse array, this is a really big sparse array. And you didn't really answer the question, so you should have just left a comment.

Mark Adler Over a year ago

A hash table of indices is an answer, and is exactly how I would approach this problem. Depending on space available, I would have 2 to 4 times 10^10 table entries, each of which contains a five-byte index of a one. Done.

Mark Ransom Over a year ago

Sorry, I think I was thrown off when you essentially started the answer with "Google it". It's going to take more than 5 bytes per index, 10^20 requires 9 bytes. If you reserve 2x the number of indices for the hash table, it will be 180,000,000,000 bytes. Is that within the reach of today's servers?

Mark Adler Over a year ago

Right, nine bytes. Thanks. I knew it was a little more than 64 bits, and somehow that got incorrectly converted to 4+1 bytes in my head instead of 8+1 bytes.

Mark Adler Over a year ago

If those are the requirements, then I see little choice. It is not possible to compress in principle to less than about 40 GB. That is how many bits are needed to enumerate all of the possible combinations of 10^10 unique indices out of 10^20 locations. 180 GB is not much worse, and accessing an element with a hash table would be much, much faster than decoding the 40 GB integer.

|

Collectives™ on Stack Overflow

Long Binary Array Compression

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related