Counting each distinct array occurrence in a list of arrays with duplicates

Question

PROBLEM

I have a list of arrays and I want to count the occurrences of duplicates.

For example, if I have this :

{{1,2,3},
 {1,0,3},
 {1,2,3},
 {5,2,6},
 {5,2,6},
 {5,2,6}}

I want a map (or any relevant collection) like this :

{ {1,2,3} -> 2,
  {1,0,3} -> 1,
  {5,2,6} -> 3 }

I can even lose the arrays values, I'm only interested in cardinals (e.g. 2, 1 and 3 here).

MY SOLUTION

I use the following algorithm :

First hash the arrays, and check if each hash is in an HashMap<Integer, ArrayList<int[]>>, let's name it distinctHash, where the key is the hash and the value is an ArrayList, let's name it rowList, containing the different arrays for this hash (to avoid collisions).
If the hash is not in distinctHash, put it with the value 1 in another HashMap<int[], Long> that counts each occurrence, let's call it distinctElements.
Then if the hash is in distinctHash, check if the corresponding array is contained in rowList. If it is, increment the value in distinctElements associated to the identical array found in rowList. (If you use the new array as a key you will create another key since their reference are different).

Here is the code, the boolean returned tells if a new distinct array was found, I apply this function sequentially on all of my arrays :

    HashMap<int[], Long> distinctElements;
    HashMap<Integer, ArrayList<int[]>> distinctHash;

    private boolean addRow(int[] row) {

        if (distinctHash.containsKey(hash)) {
            int[] indexRow = distinctHash.get(hash).get(0);
            for (int[] previousRow: distinctHash.get(hash)) {
                if (Arrays.equals(previousRow, row)) {
                    distinctElements.put(
                            indexRow,
                            distinctElements.get(indexRow) + 1
                    );
                    return false;
                }
            }
            distinctElements.put(row, 1L);

            ArrayList<int[]> rowList = distinctHash.get(hash);
            rowList.add(row);
            distinctHash.put(hash, rowList);

            return true;

        } else {
            distinctElements.put(row, 1L);

            ArrayList<int[]> newValue = new ArrayList<>();
            newValue.add(row);
            distinctHash.put(hash, newValue);

            return true;
        }
    }

QUESTION

The problem is that my algorithm is too slow for my needs (40s for 5,000,000 arrays, and 2h-3h for 20,000,000 arrays). Profiling with NetBeans told me that the hashing takes 70% of runtime (using Google Guava murmur3_128 hash function).

Is there another algorithm that could be faster? As I said I'm not interested in arrays values, only in the number of their occurrences. I am ready to sacrifice precision for speed so a probabilistic algorithm is fine.

What do you know about the structure of the arrays? Are they always 3-digit long as in the example? Do they always contain digits / numbers or can it be anything? — Radu Jakab
– Radu Jakab, Commented Oct 3, 2018 at 16:35
The arrays are of a fixed size (around 10 most of the time) determined in another part of my code. They contain only ints. — Baptiste Merliot
– Baptiste Merliot, Commented Oct 3, 2018 at 16:45
Did you try the most simple approach of creating a map, where the key is the integer array and the value is a single integer? You would need to use List<int> as the key, and Arrays.asList for insertion. — Markus
– Markus, Commented Oct 3, 2018 at 16:45
@Markus I tried this simple approach with int[] before remembering that the equals() method compares references (hence the Arrays.equals method). I will definitely try that with Lists. — Baptiste Merliot
– Baptiste Merliot, Commented Oct 3, 2018 at 18:57

Andreas · Accepted Answer · 2018-10-03 17:07:55Z

4

Wrap the int[] in a class that implements equals and hashCode, then build Map of the wrapper class to instance count.

class IntArray {
    private int[] array;
    public IntArray(int[] array) {
        this.array = array;
    }
    @Override
    public int hashCode() {
        return Arrays.hashCode(this.array);
    }
    @Override
    public boolean equals(Object obj) {
        return (obj instanceof IntArray && Arrays.equals(this.array, ((IntArray) obj).array));
    }
    @Override
    public String toString() {
        return Arrays.toString(this.array);
    }
}

Test

int[][] input = {{1,2,3},
                 {1,0,3},
                 {1,2,3},
                 {5,2,6},
                 {5,2,6},
                 {5,2,6}};
Map<IntArray, Long> map = Arrays.stream(input).map(IntArray::new)
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
map.entrySet().forEach(System.out::println);

Output

[1, 2, 3]=2
[1, 0, 3]=1
[5, 2, 6]=3

Note: The above solution is faster and uses less memory than solution by Ravindra Ranwala, but it does require the creation of an extra class, so it is debatable which is better.

For smaller arrays, use the simpler solution below by Ravindra Ranwala.
For larger arrays, the above solution is likely better.

 Map<List<Integer>, Long> map = Stream.of(input)
         .map(a -> Arrays.stream(a).boxed().collect(Collectors.toList()))
         .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

edited Oct 3, 2018 at 17:07

answered Oct 3, 2018 at 16:45

Andreas

160k13 gold badges164 silver badges262 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Baptiste Merliot Over a year ago

This solution is twice faster than mine. Thanks!

Ravindra Ranwala · Accepted Answer · 2018-10-03 16:50:58Z

3

You may do it like so,

Map<List<Integer>, Long> result = Stream.of(source)
        .map(a -> Arrays.stream(a).boxed().collect(Collectors.toList()))
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

And here's the output,

{[1, 2, 3]=2, [1, 0, 3]=1, [5, 2, 6]=3}

answered Oct 3, 2018 at 16:50

Ravindra Ranwala

21.2k7 gold badges50 silver badges65 bronze badges

1 Comment

Baptiste Merliot Over a year ago

After some tests, this solution is slower than mine. It takes 50s when mine takes 40s for the same sample.

OmG · Accepted Answer · 2018-10-03 16:52:21Z

0

If the sequence of elements for all duplication of that array is like each other and the length of each array is not much, you can map each array to an int number and using from last part of your method. Although this method decrease the time of hashing, there are some assumptions here which might not be true for your case.

answered Oct 3, 2018 at 16:52

OmG

19k13 gold badges69 silver badges96 bronze badges

Collectives™ on Stack Overflow

Counting each distinct array occurrence in a list of arrays with duplicates

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related