1

I have a program that reads a large list of sequences from a file and does a calculation among all of the pairs in that list. It then stores all of these calculations into a hashset. When running this program about halfway through, I get a GC overhead limit error.

I realize this is because the garbage collector is using up 98% of the computation time and is unable to recover even 2% of the heap. Here is the code I have:

ArrayList<String> c = loadSequences("file.txt"); // Loads 60 char DNA sequences
HashSet<DNAPair,Double> LSA = new HashSet<DNAPair,Double>(); 
for(int i = 0; i < c.size(); i++) {
    for(int j = i+1; j < c.size(); j++) {
        LSA.put(new DNAPair(c.get(i),c.get(j)),localSeqAlignmentSimilarity(c.get(i),c.get(j)));
    }
}

And here's the code for the actual method:

public static double localSeqAlignmentSimilarity(String s1, String s2) {
    s1 = " " + s1;
    s2 = " " + s2;
    int max = 0,h = 0,maxI = 0,maxJ = 0;

    int[][] score = new int[61][61];
    int[][] pointers = new int[61][61];

    for(int i = 1; i < s1.length(); i++) {
        pointers[i][0] = 2;
    }
    for(int i = 1; i < s2.length(); i++) {
        pointers[0][i] = 1;
    }

    boolean inGap = false;
    for(int i = 1; i < s1.length(); i++) {
        for(int j = 1; j < s2.length();  j++) {
            h = -99;
            if(score[i-1][j-1] + match(s1.charAt(i),s2.charAt(j)) > h) {
                h = score[i-1][j-1] + match(s1.charAt(i),s2.charAt(j));
                pointers[i][j] = 3;
                inGap = false;
            } 
            if(!inGap) {
                if(score[i-1][j] + GAPPENALTY > h) {
                    h = score[i-1][j] + GAPPENALTY;
                    pointers[i][j] = 2;
                    inGap = true;
                } 
                if(score[i][j-1] + GAPPENALTY > h) {
                    h = score[i][j-1] + GAPPENALTY;
                    pointers[i][j] = 1;
                    inGap = true;
                }
            } else {
                if(score[i-1][j] + GAPEXTENSION > h) {
                    h = score[i-1][j] + GAPEXTENSION;
                    pointers[i][j] = 2;
                    inGap = true;
                } 
                if(score[i][j-1] + GAPEXTENSION > h) {
                    h = score[i][j-1] + GAPEXTENSION;
                    pointers[i][j] = 1;
                    inGap = true;
                }
            }

            if(0 > h) h = 0;

            score[i][j] = h;
            if(h >= max) {
                max = h;
                maxI = i;
                maxJ = j;
            }
        }
    }

    double matches = 0;
    String o1 = "",  o2 = "";
    while(!(maxI == 0 && maxJ == 0)) {
        if(pointers[maxI][maxJ] == 3) {
            o1 += s1.charAt(maxI);
            o2 += s2.charAt(maxJ);
            maxI--;
            maxJ--;
        } else if(pointers[maxI][maxJ] == 2) {
            o1 += s1.charAt(maxI);
            o2 += "_";
            maxI--;
        } else if(pointers[maxI][maxJ] == 1) {
            o1 += "_";
            o2 += s2.charAt(maxJ);
            maxJ--;
        }
    }

    StringBuilder a = new StringBuilder(o1);
    b = new StringBuilder(o2);
    o1 = a.reverse().toString();
    o2 = b.reverse().toString();
    a.setLength(0);
    b.setLength(0);

    for(int i = 0; i < Math.min(o1.length(), o2.length()); i++) {
        if(o1.charAt(i) == o2.charAt(i)) matches++;
    }
    return matches/Math.min(o1.length(), o2.length());
}

I thought that this was because of all the variables I declare inside the method (the two int arrays and the stringbuilders etc.) creating more and more objects every time the method is run so I changed them all to static fields and cleared them everytime (ex. Arrays.fill(score,0);) instead of creating a new object.

However this didn't help at all and I still got the same error.

Could it be that the hashset that stores all of the calculations is getting too big and is unable to be stored by java? I'm not getting an out of heap space error so it seems kind of strange.

I also changed the command line argument to give more space to the JVM but that didn't seem to help.

Any insight on this problem would be helpful. Thanks!

3
  • How many sequences are you reading from the file? Commented Aug 19, 2013 at 22:40
  • 1
    Does DNAPair override hashCode? You are creating an n^2 set that is going to get big very quickly. You may have to swap the set to disk. Commented Aug 19, 2013 at 22:49
  • The particular file I'm using has 73657 sequences. Commented Aug 19, 2013 at 22:56

2 Answers 2

1

This is a problem, if c.size() is 73657 and the sequences are unique:

HashSet<DNAPair,Double> LSA = new HashSet<DNAPair,Double>(); 
 for(int i = 0; i < c.size(); i++) {
   for(int j = i+1; j < c.size(); j++) {
      LSA.put(...);
   }
 }

Assuming these are unique sequences, you're basically adding an element to LSA for each pair. You mention you have 70k sequences, so you are going to have 70k * 70k = ~5 billion pairs, each of which will take at a MINIMUM 4 bytes to store, meaning you'd need 20+ GB allocated at a minimum for this to be feasible.

Sign up to request clarification or add additional context in comments.

1 Comment

That makes sense, what is an alternative to store these pairs? This part of the code simply precomputes the pairs for later use within the program is there a better way I can store the data?
0

Yes, it could indeed be that the amount of data is simply too big to store in memory. I would start by trying to actually profile the program's memory usage while it is running using something like JConsole or reading from the MemoryMXBean generally from within your program.

In case it's useful, I wrote the small Classmexer agent which allows you to query the actual memory usage of a Java object (and sub-objects) from within your Java program.

Incidentally, it's usually not beneficial to try and 'fool' or pre-empt the JVM's memory management system by doing things like making objects static that really shouldn't naturally be static.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.