Retrieving repeated records from a CSV using the Java Stream API

Question

I'm a newbie to using the Java stream, but I understand that it's a replacement for a loop command. However, I would like to know if there is a way to filter a CSV file using stream, as shown below, where only the repeated records are included in the result and grouped by the Center field.

Initial CSV file

Final result

In addition, the same pair cannot appear in the final result inversely, as shown in the table below:

This shouldn't happen

Is there a way to do it using stream and grouping at the same time, since theoretically, two loops would be needed to perform the task?

I guess you meant different record except the id field right ? because it make it all the record different in your example. — Omegaspard
– Omegaspard, Commented Aug 4, 2021 at 13:34
Closely related question/answer: stackoverflow.com/a/47226834/2513200 — Hulk
– Hulk, Commented Aug 4, 2021 at 13:44
This one is for numbers, but the ideas apply here as well: stackoverflow.com/a/31341963/2513200 — Hulk
– Hulk, Commented Aug 4, 2021 at 13:57

devReddit · Accepted Answer · 2021-08-04 13:46:19Z

0

What I understood from your examples is you consider an entry as duplicate if all the attributes have same value except the ID. You can use anymatch for this:

list.stream().filter(x ->
                list.stream().anyMatch(y -> isDuplicate(x, y))).collect(Collectors.toList())

So what does the isDuplicate(x,y) do?

This returns a boolean. You can check whether all the entries have same value except the id in this method:

private boolean isDuplicate(CsvEntry x, CsvEntry y) {
    return !x.getId().equals(y.getId())
            && x.getName().equals(y.getName())
            && x.getMother().equals(y.getMother())
            && x.getBirth().equals(y.getBirth());
}

I've assumed you've taken all the entries as String. Change the checks according to the type. This will give you the duplicate entries with their corresponding ID

edited Aug 4, 2021 at 13:46

answered Aug 4, 2021 at 13:44

devReddit

2,9871 gold badge8 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Hulk Over a year ago

This is not an efficient solution.

devReddit Over a year ago

Thought of using hashset, but that will not give what OP have asked

Hulk Over a year ago

It can, with the right element type (with appropriate equals method). Or a TreeSet with a custom Comparator.

Adalberto J. Brasaca Over a year ago

@devReddit Hi... Thank you so much for your reply. You're right: hashset doesn't solve my problem. Btw, to be able to group the result is it necessary to take the resulting stream above and apply a group to the selection, right ? Can't group in the same command ?

devReddit Over a year ago

@AdalbertoJoséBrasaca You can use Collectors.groupingBy(Function<? super T,? extends K> classifier) to group the results based on specific property in the collect() of the stream.

M. Justin · Accepted Answer · 2024-12-12 06:19:48Z

record Person(
        int id, String name, String mother, LocalDate birth, int center) { }

List<Person> records = List.of(
        new Person(1,  "Antonio Carlos da Silva",  "Ana da Silva",              LocalDate.of(2008, 3, 31),  1),
        new Person(2,  "Carlos Roberto de Souza",  "Amália Maria de Souza",     LocalDate.of(2004, 12, 10), 1),
        new Person(3,  "Pedro de Albuquerque",     "Maria de Albuquerque",      LocalDate.of(2006, 4, 3),   2),
        new Person(4,  "Danilo da Silva Cardoso",  "Sônia de Paula Cardoso",    LocalDate.of(2002, 8, 10),  3),
        new Person(5,  "Ralfo dos Santos Filho",   "Helena dos Santos",         LocalDate.of(2012, 2, 21),  4),
        new Person(6,  "Pedro de Albuquerque",     "Maria de Albuquerque",      LocalDate.of(2006, 4, 3),   2),
        new Person(7,  "Antonio Carlos da Silva",  "Ana da Silva",              LocalDate.of(2008, 3, 31),  1),
        new Person(8,  "Paula Cristina de Abreu",  "Cristina Pereira de Abreu", LocalDate.of(2014, 10, 25), 2),
        new Person(9,  "Rosana Pereira de Campos", "Ivana Maria de Campos",     LocalDate.of(2002, 7, 16),  3),
        new Person(10, "Pedro de Albuquerque",     "Maria de Albuquerque",      LocalDate.of(2006, 4, 3),   2)
);

record PersonKey(String name, String mother, LocalDate birth, int center) {
    PersonKey(Person p) {
        this(p.name(), p.mother(), p.birth(), p.center());
    }
}

List<Person> result = records.stream()
        .collect(Collectors.groupingBy(PersonKey::new))
        .values()
        .stream()
        .filter(l -> l.size() > 1)
        .flatMap(Collection::stream)
        .sorted(Comparator.comparing(Person::center).thenComparing(Person::id))
        .toList();

This approach collects the list to a Map<PersonKey, List<Person>>, where PersonKey are all the fields that are involved in the duplicate record check. Next, only the List<Person> values of this map are kept and are converted to a Stream<List<Person>>. Every element in this stream with at least two elements are kept (i.e. the ones that are duplicates). This is flattened into a Stream<Person>, and then sorted by center with id as the tie-breaker. Finally, the stream is collected into a List<Person>.

Bohemian · Accepted Answer · 2024-12-13 08:49:58Z

0

You can do it in one pass as a stream with O(n) efficiency:

class PersonKey {
    // have a field for every column that is used to detect duplicates
    String center, name, mother, birthdate;
    public PersonKey(String line) {
        // implement String constructor
    }
    // implement equals and hashCode using all fields
}

List<String> lines; // the input 
Set<PersonKey> seen = ConcurrentHashMap.newKeySet(); // threadsafe
List<String> unique = lines.stream()
  .filter(p -> !seen.add(new PersonKey(p))
  .distinct()
  .collect(toList());

The trick here is that a HashSet has constant time operations and its add() method returns false if the value being added is already in the set, true otherwise.

edited Dec 13, 2024 at 8:49

answered Aug 4, 2021 at 14:01

Bohemian♦

427k103 gold badges603 silver badges750 bronze badges

9 Comments

Hulk Over a year ago

+1 although personally, if I have to use 2 stateful filters in a stream, I'd prefer the loop. If I'm not mistaken, the found-Set could be dropped if we can live with e.g. a LinkedHashSet-result instead of a list.

Bohemian Over a year ago

@hulk you are mistaken :) the problem is the objects being collected in the set are not the objects needed in the result. You could avoid the 2nd set by by using distinct after mapping to an object whose equals method ignored its id field, which is a violation of all things good. But internally distinct uses a HashSet anyway, so this code is no heavier - you’re just seeing the set. A LinkedHashSet is not needed - either way always finds the first unique item.

Hulk Over a year ago

Ah I see, I missed that indeed. Well then, this is probably as good as it gets :)

g00se Over a year ago

You can probably use a csv driver to run that as a sql query

Adalberto J. Brasaca Over a year ago

@Bohemian I don't think I expressed myself correctly. What I need to know is a list of which records are duplicates, not just whether they are. So I think using hashset doesn't solve the problem. Btw, thank you so much for your reply.

|

Collectives™ on Stack Overflow

Retrieving repeated records from a CSV using the Java Stream API

3 Answers 3

5 Comments

Comments

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related