0

I have an application which forms all the possible pairs and then compare the pairs, but when I run the applicaion, it gave me exception:OutOfMemoryError:Java heap space when running the code. I tried -Xmx1500m, but the exception just kept coming. the code for generating the pairs are as following

File file = ...;

final Map<Pair, Collection<Integer>> lineNumbersByPair = new HashMap<Pair, Collection<Integer>>();

/*
 * Step 1: Read in the lines, one by one.
 */
Reader reader = new FileReader(file);
try {
    BufferedReader bufferedReader = new BufferedReader(reader);
    try {
        String line;

        int lineNumber = 0;
        while ((line = bufferedReader.readLine()) != null) {
            lineNumber++;

            String[] tokens = line.split("\\s+");
            int[] values = new int[tokens.length];

            for (int i = 0; i < tokens.length; i++) {
                values[i] = Integer.parseInt(tokens[i]);
            }

            for (int i = 0; i < values.length; i++) {
                for (int j = i + 1; j < values.length; j++) {
                    Pair pair = new Pair(values[i], values[j]);

                    Collection<Integer> lineNumbers;
                    if (lineNumbersByPair.containsKey(pair)) {
                        lineNumbers = lineNumbersByPair.get(pair);
                    } else {
                        lineNumbers = new HashSet<Integer>();
                        lineNumbersByPair.put(pair, lineNumbers);
                    }
                    lineNumbers.add(lineNumber);
                }
            }
        }
    } finally {
        bufferedReader.close();
    }
} finally {
    reader.close();
}

/*
 * Step 2: Identify the unique pairs. Sort them according to how many lines they appear on (most number of lines to least number of lines).
 */
List<Pair> pairs = new ArrayList<Pair>(lineNumbersByPair.keySet());
Collections.sort(
        pairs,
        new Comparator<Pair>() {
            @Override
            public int compare(Pair pair1, Pair pair2) {
                Integer count1 = lineNumbersByPair.get(pair1).size();
                Integer count2 = lineNumbersByPair.get(pair2).size();
                return count1.compareTo(count2);
            }
        }
    );
Collections.reverse(pairs);

/*
 * Step 3: Print the pairs and their line numbers.
 */
for (Pair pair : pairs) {
    Collection<Integer> lineNumbers = lineNumbersByPair.get(pair);
    if (lineNumbers.size() > 1) {
        System.out.println(pair + " appears on the following lines: " + lineNumbers);
    }
}

I am reading a file around 15mb and it contains like 20000lines of single numbers, and there are around 40 numbers per line, it forms all possible pairs of each line. anyone has any idea about how to solve this? thanks

4
  • 1
    are you getting proper results for smaller files? Any successful output so far? Commented Oct 13, 2010 at 4:51
  • yes, it works fine on smaller files, and the output is all good Commented Oct 13, 2010 at 4:52
  • starcaller, as J-16 SDiZ commented and as I have posted - this problem as it stands is going to be a memory and time destroyer. What is your use case or problem you're really trying to solve? Commented Oct 13, 2010 at 5:05
  • @kuropenguin, I'm trying to get all the pairs that appears more than 2% among all the pairs from each line. for example, there is 10000lines, so i need to find pairs that appears in more than 200 lines. Commented Oct 13, 2010 at 5:11

2 Answers 2

1

My math may be off, but this is probably why you're still running out of space.

So 40 numbers per line, 20000 lines = 800000 numbers.

800000 C 2 = 319999600000 combinations of numbers.

At 4 bytes an int, and with Pair<int, int>, each Pair is at least 8 bytes, and then you add it to your data structure.

8 bytes * 319999600000 = 2+ terabytes.

After re-reading your problem, each line is separate from the next.

40 numbers per line => 40 C 2 = 780 combinations per line * 20000 lines = 15600000 possible unique pairs * 8 bytes per pair = 119 MB purely for int in the worst case. Add to that the memory taken up by references since Java does not allow primitive types in a Collection.

But after taking another look at your program, I have some suggestions:

Why are you mapping a Pair to a Set<Integer>?

If you are interested in just the number of occurrences of each Pair, you do not need to keep track of the line numbers the pairs occur on - you only want to store the number of times it has shown up.

So in this case you want to map Pair to Integer. This could reduce the amount of memory you require.

Do you care about ordering of the pair?

Your for loop seems to indicate you do not care about ordering, i.e. the pair (30,45) is the same as the pair (45,30). If so, you should create your Pair based on relative ordering in the pair. Perhaps making a Pair based on the smallest value first, so that every time you encountered two integers m and n, you would always create the pair as Pair(m, n). Also see the next section about hashCode() and equals().

Have you implemented Pair's int hashCode() and boolean equals(Object) methods?

This could be the difference between an actual working program and a broken one.

In your case, you want Pair objects to test for logical equality as it is a custom class, so you will have to override and implement your own equals(Object) method. You will also want to override hashCode() as you must always do so when overriding equals().

This is detailed in the excellent Effective Java, and here's a sample of the chapter discussing this: http://java.sun.com/developer/Books/effectivejava/Chapter3.pdf

Sign up to request clarification or add additional context in comments.

3 Comments

is there a better strcture for storing the pairs?
Even worst, the time just listing the pair ... assume you can get 10k pair each second == 319999600000 / 10000 / 60 / 60 / 24 = more than one year! ____ you should rethink your problem, not the data structure.
if a Pair has two Integer fields, it should take 40 bytes on 64 bit systems or 24 bytes on 32 bit systems, IIRC. so 595MB or 357MB, plus a bit more for arraylist overhead.
0

When data became too big to fit memory, the only way is use extended memory (HDD). Here you can partition and store on disk, load small part each onto memory and search.

Or you should use a algobrithm that use less memory and more processor.
1.Search the file, search all num, and create a relative 2-d matrix or something like below.

  1 2 3 4 ...
1 0 1 0 0
2 0 1 0 0
3 0 0 0 0
...


2.You can sort on this matrix.
3.One by one pair, search the file to get line num contain both numbers in pair.
Sorry because my bad english.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.