7

Before starting to explain my problem, I should mention that I am not looking for a way to increase Java heap memory. I should strictly store these objects.

I am working on storing huge number (5-10 GB) of DNA sequences and their counts (Integer) in a hash table. The DNA sequences (with length 32 or less) consists of 'A', 'C', 'G', 'T', and 'N' (undefined) chars. As we know, when storing a large number of objects in memory, Java has poor space efficiency compared to lower level languages like C and C++. Thus, if I store this sequence as string (it holds about 100 MB memory for a sequence with length ~30), I see the error.

I tried to represent nucleic acids as 'A'=00, 'C'=01, 'G'=10, 'T'=11 and neglect 'N' (because it ruins the char to 2-bit transform as the 5-th acid). Then, concatenate these 2-bit acids into byte array. It brought some improvement but unfortunately I see the error after a couple of hours again. I need a convenient solution or at least a workaround to handle this error. Thank you in advance.

5
  • Well you need to increase Java memory. How much are you giving the JVM? Commented Apr 14, 2017 at 2:25
  • Can you divide your problem into equivalent subproblems that can be solved sequentially? i.e. if I need to sum 1000 numbers, I could separate this problem into 10 subproblems, each one consisting of summing 100 numbers, and then sum the partial results Commented Apr 14, 2017 at 2:51
  • @stdunbar About 2 GB. I should continue with the default settings for JVM as I said. Most probably, if I allocate 10 GB for JVM, the program will run without error. Commented Apr 14, 2017 at 6:05
  • @FedericoPeraltaSchaffner What if the task is finding the most frequent N number of subsequences with their frequencies? Yes, I can split the problem into B subproblems. Then, for each subproblem, I may select K number of most frequent subsequences. At last, I merge all (B*K) subsequences and group the nonunique ones to find their frequencies "approximately". It is possible to select most frequent N subsequents from this approximate list but I need exact results. Commented Apr 14, 2017 at 6:05
  • OK, if you need exact results and your problem can't be split into subproblems that return an exact result, and if the data structure you need to use doesn't fit into available memory, then you need to start thinking about using a database. Commented Apr 14, 2017 at 13:52

3 Answers 3

3

Being fairly complex maybe this here is a weird idea, and would require quite a lot of work, but this is what I would try:

You already pointed out two individual subproblems of your overall task:

  • the default HashMap implementation may be suboptimal for such large collection sizes
  • you need to store something else than strings

The map implementation

I would recommend to write a highly tailored hash map implementation for the Map<String, Long> interface. Internally you do not have to store strings. Unfortunately 5^32 > 2^64, so there is no way to pack your whole string into a single long, well, let's stick to two longs for a key. You can make string to/back long[2] conversion fairly efficiently on the fly when providing a string key to your map implementation (use bit shifts etc).

As for packing the values, here are some considerations:

for a key-value pair a standard hashmap will need to have an array of N longs for buckets, where N is the current capacity, when the bucket is found from the hash key it will need to have a linked list of key-value pairs to resolve keys that produce identical hash codes. For your specific case you could try to optimize it in the following way:

  • use a long[] of size 3N where N is the capacity to store both keys and values in a continuous array
  • in this array, at locations 3 * (hashcode % N) and 3 * (hashcode % N) + 1 you store the long[2] representation of the key, of the first key that matches this bucket or of the only one (on insertion, zero otherwise), at location 3 * (hashcode % N) + 2 you store the corresponding count
  • for all those cases where a different key results in the same hash code and thus the same bucket, your store the data in a standard HashMap<Long2KeyWrapper, Long>. The idea is to keep the capacity of the array mentioned above (and resize correspondingly) large enough to have by far the largest part of the data in that contiguous array and not in the fallback hash map. This will dramatically reduce the storage overhead of the hashmap
  • do not expand the capacity in N=2N iterations, make smaller growth steps, e.g. 10-20%. this will cost performance on populating the map, but will keep your memory footprint under control

The keys

Given the inequality 5^32 > 2^64 your idea to use bits to encode 5 letters seems to be the best I can think of right now. Use 3 bits and correspondingly long[2].

Sign up to request clarification or add additional context in comments.

Comments

2

I recommend you look into the Trove4j Collections API; it offers Collections that hold primitives which will use less memory than their boxed, wrapper classes.

Specifically, you should check out their TObjectIntHashMap.

Also, I wouldn't recommended storing anything as a String or char until JDK 9 is released, as the backing char array of a String is UTF-16 encoded, using two bytes per char. JDK 9 defaults to UTF-8 where only one byte is used.

1 Comment

The problem is how can 5-acid DNA sequence with length 32 encode as a primitive type. Considering long, it is a 64-bit integer but there are 5^32 possible sequences. Also, even while converting char sequence into byte array, the additional structures and operations such as toCharArray(), concat(), etc. may bother Java heap and cause the same error.
0

If you're using on the order of ~10gb of data, or at least data with an in memory representation size of ~10gb, then you might need to think of ways to write the data you don't need at the moment to disk and load individual portions of your dataset into memory to work on them.

I had this exact problem a few years ago when I was conducting research with monte carlo simulations so I wrote a Java data structure to solve it. You can clone/fork the source here: github.com/tylerparsons/surfdep

The library supports both MySQL and SQLite as the underlying database. If you don't have either, I'd recommend SQLite as it's much quicker to set up.

Full disclaimer: this is not the most efficient implementation, but it will handle very large datasets if you let it run for a few hours. I tested it successfully with matrices of up to 1 billion elements on my Windows laptop.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.