0

I have a list of objects (e.g. List and I have to find duplicate in the list. I don't have the source code of UnknownSrcClass and UnknownSrcClass doesn't have hash code and equals implemented. So I can't put it in a Set to find the duplicates

I have below two solution

  1. Build a HashMap<String,List<UnknownSrcClass>> where key will be build using the fields responsible for equality check.

Iterate the HashMap if for a key list size > 1 then iterate the list of items and find duplicates

  1. Put items in a TressSet with Comprator and check for add method's return value.

Please suggest me which one would more efficient performance wise. Per me #3 is better approach.

3
  • The answer to any "which is the most efficient"-type questions is: what do you mean by "efficient"? If you mean which has the smallest asymptotic complexity, well, just look up the asymptotic complexity for the two data structures; if you mean which takes less time for your application, you need to implement both and time them. Commented Sep 22, 2016 at 21:09
  • 3
    Different approach - how about an UnknownSrcClassWrapper which implements hashCode() and equals() on the relevant fields of UnknownSrcClass? Commented Sep 22, 2016 at 21:24
  • 1
    So if you don't know what UnknownSrcClass is, how are you supposed to tell whether two instances are duplicates? Commented Sep 22, 2016 at 21:50

2 Answers 2

1

The first approach may not be be more efficient than the second approach. Let's assume the equality check is based on 2 strings that you concatenate. There are different possibilities to get to the same string.

In the worst case you get a HashMap with a single key but n different elements in the value. Pairwise comparing them leads to running time of O(n²). This is worse than the O(n * log(n)) running time you achieve using a TreeSet.

If you use something like the first approach, create something does not lead to different values being mapped to the same key, e.g. combine the values using Arrays.asList:

HashSet<List<Object>> set = new HashSet<>();
for (Iterator<UnknownSrcClass> iterator = list.iterator(); iterator.hasNext();) {
    UnknownSrcClass element = iterator.next();
    List<Object> lst = (element == null ? null : Arrays.asList(element.getProperty1(), element.getProperty2(), ...));
    if (!set.add(lst)) {
        // handle duplicate, e.g.
        iterator.remove();
    }
}
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the answer. What I did was build String (String concatenation) using all properties that was used in equality check. Use that key in HashMap. That way I believe it will distribute the data to multiple keys. Let me know your thought.
@Debopam A concatenation the same string can be created by different properties, e.g. "abdefg" = "" + "abcdefg" = "a"+"bcdefg" = "ab"+"cdefg" = "abc"+ "defg" = "abcde"+"fg" = "abcdef" +"g" = "abcdefg"+"" For the worst case you always get the same string from your properties, but the strings are different and therefore the O(n²) for comparing all of them. The list returned by Arrays.asList implements hashCode/equals based on the hashcodes/equality of the elements preventing this kind of behavior...
Thanks. Is this approach (code snippet you have given) more efficient than TreeSet with comparator?
@Debopam: In the expected case, yes. However if you are unlucky with the hash codes, all the elements could be mapped to the same bucket. In that case the performance would still be worse than the performance of a TreeSet.
1

I think the #1 is Ok, cos I think the cost of #1 would be O(n) but #3 would be > O(n) as long as compare would be called for each entry through the entire list. this is my #1 option:

public class Main {

static class Model {
public final Long id;
public final String field1;
public final boolean fieldn;

public Model(Long id, String field1, boolean fieldn) {
    super();
    this.id = id;
    this.field1 = field1;
    this.fieldn = fieldn;
}

}

public static void main(String[] args) throws InterruptedException {

List<Model> list = Arrays.asList(new Model(1L, "sample 1", true), new Model(1L, "sample 1", true));
Map<String, List<Model>> doublications = new HashMap<>();
list.forEach(m -> checkDoublication(doublications, m));
doublications.forEach(Main::print);
// and this would print => key: "1sample 1true", doublications: 1

}

private static void print(String key, List<Model> list) {
System.out.println(String.format("key: \"%s\", doublications: %d", key, list.size()));
}

private static String key(Model model) {
return model.id + model.field1 + model.fieldn;
}

private static void checkDoublication(Map<String, List<Model>> map, Model model) {
String key = key(model);
if (!map.containsKey(key))
    map.put(key, new LinkedList<>());
else
    map.get(key)
       .add(model);

}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.