0

I want to read a huge csv file by Java. It includes 75,000,000 lines. The problem is, even though I am using maximum xms and xmx limits, i am getting: `java.lang.OutOfMemoryError(GC overhead limit exceeded), and it shows this line causes the error:

String[][] matrix = new String[counterRow][counterCol];

I did some tests and see that i can read 15,000,000 lines well. Therefore I started to use this sort of code:

String csvFile = "myfile.csv";
List<String[]> rowList = new ArrayList();
String line = "";
String cvsSplitBy = ",";
BufferedReader br = null;
try {
    int counterRow = 0, counterCol = 12, id = 0;
    br = new BufferedReader(new FileReader(csvFile));
    while ((line = br.readLine()) != null) { 
        String[] object = line.split(cvsSplitBy);
        rowList.add(object); 
        counterRow++;
        if (counterRow % 15000000 ==0) {
            String[][] matrix = new String[counterRow][counterCol];
            .. do processes ..
            SaveAsCSV(matrix,id);
            counterRow=0; id++; rowList.clear();
        }
    }
}
...

Here, it writes first 15.000.000 lines very well, but in the second trial, this again gives the same error, although counterRow is 15,000,000.

In summary, I need to read a csv file that includes 75,000,000 rows (approx 5 GB) in Java and save a new csv file or files after doing some processes with its records.

How can I solve this problem?

Thanks

EDIT: I am also using rowList.clear() guys, forgot to specify here. sorry.

EDIT 2: My friends, I dont need to put all file in memory. How can I read it part by part. Actually this is what I tried to do by using if(counterRow%15000000==0). What is its correct way?

12
  • 2
    That's a huge amount of data to have in memory - why don't you try writing to a database, then querying it? Commented Aug 7, 2014 at 14:25
  • 1
    You definitely can't bring the whole goddamn file into memory. Can you process the file in batches/parts? Commented Aug 7, 2014 at 14:25
  • 1
    Memory mapped files? javarevisited.blogspot.de/2012/01/… Commented Aug 7, 2014 at 14:26
  • If your file is 5GB and you want to keep it in memory, you'll need at lead 5 GB of RA% I think, huge ^^ Commented Aug 7, 2014 at 14:27
  • 1
    streaming is your best friend here Commented Aug 7, 2014 at 14:27

4 Answers 4

4

You can read the lines individually then do your processing until you have read the entire file

String encoding = "UTF-8";
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
String line;
while ((line = br.readLine()) != null) {
   // process the line.
}
br.close();

this should not go fubar just make sure you proces it immediatly and don't store it in variables outside your loop

Sign up to request clarification or add additional context in comments.

2 Comments

I am gonna try this soon and let u know the result my friend. thanks
IT WORKS VERY WELL MY FRIEND! THANKS A LOT!
1

The issue is not that you do not have enough memory, the issue "GC overhead limit exceeded" means that the garbage collection is taking too long. You cannot fix this by allocating more memory, but only by using -XX:-UseGCOverheadLimit. That is, if you really want that much data in memory.

See e.g. How to solve "GC overhead limit exceeded" using maven jvmArg?

Or use peter lawrey's memory-mapped HugeCollections: http://vanillajava.blogspot.be/2011/08/added-memory-mapped-support-to.html?q=huge+collections : It writes to disk if the memory is full.

1 Comment

Ah nice point. I am using rowList.clear() also, forgot to copy/paste here!
0

Maybe you forgot to call

rowList.clear();

after

counterRow=0; id++;

1 Comment

Ah nice point. I am using rowList.clear() also, forgot to copy/paste here!
0

The “java.lang.OutOfMemoryError: GC overhead limit exceeded” error will be displayed when your application has exhausted pretty much all the available memory and GC has repeatedly failed to clean it.

The solution recommended above - specifying a -XX:-UseGCOverheadLimit is something I strongly suggest not to do. Instead of fixing the problem you are just postponing the inevitable: the application is running out of memory and needs to be fixed. Specifying this option just masks the original “java.lang.OutOfMemoryError: GC overhead limit exceeded” error with a more familiar message “java.lang.OutOfMemoryError: Java heap space”.

Possible solutions pretty much boil down to two reasonable alternatives in your case - either increase heap space (-Xmx parameter) or reduce the heap consumption of your code by reading the file in smaller batches.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.