1

I am currently developing an application that processes several files, containing around 75,000 records a piece (stored in binary format). When this app is ran (manually, about once a month), about 1 million records are contained entirely with the files. Files are put in a folder, click process and it goes and stores this into a MySQL database (table_1)

The records contain information that needs to be compared to another table (table_2) containing over 700k records.

I have gone about this a few ways:

METHOD 1: Import Now, Process Later

In this method, I would import the data into the database without any processing from the other table. However when I wanted to run a report on the collected data, it would crash assuming memory leak (1 GB used in total before crash).

METHOD 2: Import Now, Use MySQL to Process

This was what I would like to do but in practice it didn't seem to turn out so well. In this I would write the logic in finding the correlations between table_1 and table_2. However the MySQL result is massive and I couldn't get a consistent output, sometimes causing MySQL giving up.

METHOD 3: Import Now, Process Now

I am currently trying this method and although the memory leak is subtle, It still only gets to about 200,000 records before crashing. I have tried numerous forced garbage collections along the way, destroying properly classes, etc. It seems something is fighting me.

I am at my wits end trying to solve the issue with memory leaking / the app crashing. I am no expert in Java and have yet to really deal with very large amounts of data in MySQL. Any guidance would be extremely helpful. I have put thought into these methods:

  • Break each line process into individual class, hopefully expunging any memory usage on each line
  • Some sort of stored routine where once a line is stored into the database, MySQL does the table_1 <=> table_2 computation and stores the result

But I would like to pose the question to the many skilled Stack Overflow members to learn properly how this should be handled.

2
  • 1
    You can run it in a profiler and see, or, start with, just comment out the code updating the database, and see if you have a memory leak. By commenting out sections, you should find which part is the problem, then you can look at it more closely by having a unit test exercise that part 100k times, in a profiler, and see what is going on. Commented Oct 15, 2011 at 2:20
  • A million records isn't that much these days, but unless you have coded to minimise memory it can easily use more than 1 GB. You can buy 16 GB for around $100, so perhaps using more memory is the simplest solution. Commented Oct 15, 2011 at 8:58

4 Answers 4

4

I concur with the answers that say "use a profiler".

But I'd just like to point out a couple of misconceptions in your question:

  • The storage leak is not due to massive data processing. It is due to a bug. The "massiveness" simply makes the symptoms more apparent.

  • Running the garbage collector won't cure a storage leak. The JVM always runs a full garbage collection immediately before it decides to give up and throw an OOME.


It is difficult to give advice on what might actually be causing the storage leak without more information on what you are trying to do and how you are doing it.

Sign up to request clarification or add additional context in comments.

1 Comment

I ended up using the profiler @ed-staub suggested below, VirtualVM. What I noticed was the JDBC I was using to connect to MySQL was somewhere collecting bits of data. As well I went into the development with the wrong mindset of pumping everything through one connection. I split the connection up, threaded the record processing and stored the referenced database into memory (saved ~ 2 seconds per call).
2

The learning curve for a profiler like VirtualVM is pretty small. With luck, you'll have an answer - at least a very big clue - within an hour or so.

Comments

0

you properly handle this situation by either:

  • generating a heap dump when the app crashes and analyzing that in a good memory profiler
  • hook up the running app to a good memory profiler and look at the heap

i personally prefer yjp, but there are some decent free apps as well (e.g. jvisualvm and netbeans)

Comments

0

Without knowing too much about what you're doing, if you're running out of memory there's likely some point where you're storing everything in the jvm, but you should be able to do a data processing task like this the severe memory problems you're experiencing. In the past, I've seen data processing pipelines that run out of memory because there's one class reading stuff out of the db, wrapping it all up in a nice collection, and then passing it off to another, which of course requires all of the data to be in memory simultaneously. Frameworks are good for hiding this sort of thing.

Heap dumps/digging with virtualVm hasn't been terribly helpful for me , as the details I'm looking for are often hidden - e.g. If you've got a ton of memory filled with maps of strings it doesn't really help to tell you that Strings are the largest component of your memory useage, you sort of need to know who owns them.

Can you post more detail about the actual problem you're trying to solve?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.