2

I'm reading a file to parse few of the fields of each record as a reference key and another field as the reference value. These keys and values are referred for another process. Hence, I chose a HashMap, so that I can get the values for each key, easily.

But, each of the file consists of tens of millions or records. Hence, the HashMap throws OutOfMemoryError. I hope increasing the heap memory will not be a good solution, if the input file in future grows.

For similar questions in SO, most have suggested to use a database. I fear I'll not be given option to use a DB. Is there any other way to handle the problem?

EDIT: I need to do this similar HashMap Loading for 4 such files :( I need all the four. Bcoz, If I dont find a matching entry for my input in the first Map, I need to find in second, then if there not, then third and finally in fourth.

Edit 2: The files I have sums up to, around 1 GB. EDIT 3:

034560000010000001750                                  
000234500010000100752                            
012340000010000300374

I have records like these in a file.. I need to have 03456000001000000 as key and 1750 as value.. for all the millions of records. I'll refer these keys and get the value for my another process.

8
  • 1
    Do you need to take each entry into hash map everytime? Commented Dec 22, 2014 at 10:29
  • @DarshanLila: Yes. Because, In another process, I need to find there is matching key-value from the above file. So, i need all the entries, there. Commented Dec 22, 2014 at 10:31
  • What are your data types (of key, value)? ( Is your memory limit set appropriately? (By default Java will only use 25% of your system memory) Commented Dec 22, 2014 at 10:33
  • You need to be more specific about what the two processes do. It may be that you could load only the data from the file that is a good candidate for the second process. Are both processes of your own writing? Commented Dec 22, 2014 at 10:35
  • Do you need all records in every file ? If not, you could remove unwanted records from the input files before parsing/processing. Commented Dec 22, 2014 at 10:36

7 Answers 7

3

Using a database will not reduce memory cost or runtime per itself.

However, the default hashmaps may not be what you are looking for, depending on your data types. When used with primitive values such as Integers then java hashmaps have a massive memory overhead. In a HashMap<Integer, Integer>, every entry uses like 24+16+16 bytes. Unused entries (and the hashmap keeps up to half of them unused) take 4 bytes extra. So you can roughly estimate >56 bytes per int->int entry in Java HashMap<Integer, Integer>.

If you encode the integers as String, and we're talking maybe 6 digit numbers, that is likely 24 bytes for the underlying char[] array (16 bit characters; 12 bytes overhead for the array, sizes are a multiple of 8!), plus 16 bytes for the String object around (maybe 24, too). For key and value each. So that is then around 24+40+40, i.e. over 104 bytes per entry. (Update: as your keys are 17 characters in length, make this 24+62+40, i.e. 136 bytes)

If you used a primitive hashmap such as GNU Trove TIntIntHashMap, it would only take 8 bytes + unused, so lets estimate 16 bytes per entry, at least 6 times less memory. (Update: for TLongIntHashMap, estimate 12 bytes per entry, 24 bytes with overhead of unused buckets.)

Now you could also just store everything in a massive sorted list. This will allow you to perform a fast join operation, and you will lose much of the overhead of unused entries, and can probably process twice as many in much shorter time.

Oh, and if you know the valid value range, you can abuse an array as "hashmap".

I.e. if your valid keys are 0...999999, then just use an int[1000000] as storage, and write each entry into the appropriate row. Don't store the key at all - it's the offset in the array.

Last but not least, Java by default only uses 25% of your memory. You probably want to increase its memory limit.

Sign up to request clarification or add additional context in comments.

3 Comments

I ll give a try with TIntIntHashMap. But, array as "hashmap" - No, bcoz keys are not in sequence :(
<Int,Int> not possible for me. It's <Long,Int> :(
Then use TLongIntHashMap.
2

Short answer: no. It's quite clear that you can't load your entire dataset in memory. You need a way to keep it on disk together with an index, so that you can access the relevant bits of the dataset without rescanning the whole file every time a new key is requested.

Essentially, a DBMS is a mechanism for handling (large) quantities of data: storing, retrieving, combining, filtering etc. They also provide caching for commonly used queries and responses. So anything you are going to do will be a (partial) reimplementation of what a DBMS already does.

I understand your concerns about having an external component to depend on, however note that a DBMS is not necessarily a server daemon. There are tiny DBMS which link with your program and keep all the dataset in a file, like SQLite does.,

Comments

1

Such large data collections should be handled with a database. Java programs are limited in memory, varying from device to device. You provided no info about your program, but please remember that if it is run on different devices, some of them may have very little ram and will crash very quickly. DB (be it SQL or file-based) is a must when it comes to large-data programs.

Comments

1

You have to either

a) have enough memory load to load the data into memory.

b) have to read the data from disk, with an index which is either in memory or not.

Whether you use a database or not the problem is much the same. If you don't have enough memory, you will see a dramatic drop in performance if you start randomly accessing the disk.

There are alternatives like Chronicle Map which use off heap and performs well up to double your main memory size so you won't get an out of memory error, however you still have problem that you can't store more data in memory than you have main memory.

Comments

0

The memory footprint depends on how you approach the file in java. A widely used solution is based on streaming the file using the Apache Commons IO LineIterator. Their recommended usage

 LineIterator it = FileUtils.lineIterator(file, "UTF-8");
 try {
   while (it.hasNext()) {
     String line = it.nextLine();
     // do something with line
   }
 } finally {
   it.close();
 }

Its an optimized approach, but if the file is too big, you can still end up with OutOfMemory

Comments

0

Since you write that you fear that you will not be given the option to use a database some kind of embedded DB might be the answer. If it is impossible to keep everything in memory it must be stored somewhere else.

I believe that some kind of embedded database that uses the disk as storage might work. Examples include BerkeleyDB and Neo4j. Since both databases use a file index for fast lookups the memory load is lesser than if you keep the entire load in memory but they are still fast.

Comments

-1

You could try lazy loading it.

2 Comments

Except this is going through the file each time a new key is required and will eventually cause the same error if all keys are needed.
That's true... except he is going to 'unload' the entries but in that case a DB would for sure be easier to use and maintain

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.