12

Do you know of any time efficient way to remove duplicated values from a very big integer array using Java? The size of the array depends on the logged in user, but will always exceed 1500000 unsorted values with some duplicates. Every integer contains a number between 100000 and 9999999.

I tried converting it to a List, but the heap on my server doesn't allow this amount of data(my ISP has restricted it). And a regular for loop within a for loop takes over 5 minutes to calculate.

The size of the array without the duplicates is the one I will store in my database.

Help would be appreciated!

9 Answers 9

37

You could perhaps use a bit set? I don't know how efficient Java's BitSet is. But 9999999 possible values would only take 9999999 / 8 = 1250000 bytes = just over 1Mb. As you walk the array of values, set the corresponding bit to true. Then you can walk over the bit set and output the corresponding value whenever you find a bit set to true.

1Mb will fit in a CPU cache, so this could be quite efficient depending on the bit set implementation.

This also has the side-effect of sorting the data too.

And... this is an O(n) algorithm since it requires a single pass over the input data, the set operations are O(1) (for an array-based set like this), and the output pass is also O(m) where m is the number of unique values and, by definition, must be <= n.

Sign up to request clarification or add additional context in comments.

1 Comment

Clever answers like these are the reason why I come to StackOverflow
3

I would make a hashset where I store all values contained in the list, before i start adding items to the list. Then just check so that the hashset doesn't contain the value you want to add.

3 Comments

"I tried converting it to a List, but the heap on my server doesn't allow this amount of data" - that probably rules out Sets as well.
In my mind a list is a bit more wastefull with memory than a hashset, for large datasets. But i might be wrong. =/
That largely depends on the list implementation. I believe ArrayList is more memory-efficient than HashSet, but I might be wrong too :-)
3
Set<Integer> set = new HashSet<Integer>();
Collections.addAll(set, array);

you will just need an array of Integer[] instead of int[].

2 Comments

"I tried converting it to a List, but the heap on my server doesn't allow this amount of data" - that probably rules out Sets as well.
Yes, that's more to the point. @user435140 note that this will only work if your array holds Integer's, not primitive int's.
2

You can try sorting the array first:

int arr[] = yourarray;
Arrays.sort(arr);
// then iterate arr and remove duplicates

10 Comments

@Bozho he could iterate the array and count unique values. Apparently is the only thing he needs to do ...The size of the array without the duplicates is the one I will store in my database...
By sorting first, you can then do a final traversal of the array and only keep one of each unique value. That should give a complexity of O(n log n) as opposed to O(n^2) for the double loop mentioned.
Assuming you have sufficient resources to sort the thing in the first place!
@Danny, Arrays.sort(...) does not use more space: it sorts "in place".
@Bart K - this depends on your implementation, but the JDK doesn't guarantee an in-place sort. Many actually use a form of mergesort which requires O(n) extra space.
|
2

The truly desperate could write the array to disk and fork off sort | uniq | wc -l <infile.txt and capture the output. This would be needed if memory was still too tight or the domain space of integers got larger. I don't like this (is he even running unix!) but my point is that there are many ways to accomplish the task.

Another observation is that the minimum value is 100,000. So we could subtract 100,000 from the maximum value of 9,999,999, reducing the domain space and thus saving some memory. Perhaps 100k/8 bits is peanuts in the scheme of things, but it's essentially free to do it.

Comments

1
int[] a;
Arrays.sort(a);
int j = 0;
for (int i = 1; i < a.length; ++i) {
  if (a[i] != a[j]) {
    ++j;
    a[j] = a[i];
  }
}
// now store the elements from 0 to j (inclusive - i think)

1 Comment

If the result doesn't need to be sorted you can copy values from the "start" (which increments when copied) to reduce the number of copies. (one per duplicate instead of one per element)
0

Perhaps you could make a handful of passes over the data? For example, if you made ten passes over the data and applied one of the set suggestions above to a smaller subset of the data (say, when value mod pass# == 0). Thus:

for (int i = 0 to 9) {
  set = new Set()
  for (each entry in the data set) {
    if (entry % i == 0) {
      set.add(entry)
    }
  }
  output set
}

This way you will trade off time for memory (increase the number of passes for less memory/more time and vice-versa).

Comments

0

Maybe a hash set that works with primitives instead of objects will do the job? There are free implementations (havn't used them before but maybe it works):

http://trove4j.sourceforge.net/

http://trove4j.sourceforge.net/javadocs/gnu/trove/TIntHashSet.html

Would then look like:

int[] newArray = new TIntHashSet(yourArray).toArray();

Comments

0

If you are sure, that integers have resonable small values (e.g. always more than zero and less than 1000 or 10000), you can try a trick like this:

    final int MAX = 100; 
    int[] arrayWithRepeats = {99, 0, 10, 99, 0, 11, 99};

    //we are counting here integers with the same value
    int [] arrayOfValues = new int[MAX+1];
    int countOfUniqueIntegers = 0;
    for(int i : arrayWithRepeats) {
        if(arrayOfValues[i] == 0) {
            countOfUniqueIntegers++;
        }
        arrayOfValues[i]++;
    }

    // you can use arrayOfValues (smaller) or convert it
    // to table of unique values (more usable)

    int[] arrayOfUniqueValues = new int[countOfUniqueIntegers];
    int index = 0;
    for(int i = 0; i<arrayOfValues.length; i++) {
        if(arrayOfValues[i] != 0) {
            arrayOfUniqueValues[index] = i;
            index++;
        }
    }

    //and now arrayOfUniqueValues is even sorted
    System.out.println( Arrays.toString(arrayOfUniqueValues) );

Output: [0, 10, 11, 99]

3 Comments

This is essentially the same as my bit set suggestion, except you're using 32 bits per entry instead of 1, so memory becomes an issue quite quickly. Also, the OP said that the values will be up to 9999999.
Since "Every integer contains a number between 100000 and 9999999" this will not work.
You are right. And good idea is to change arrayOfValues form int[] to BitSet as Danny's idea.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.