2

I'm a newbie to using the Java stream, but I understand that it's a replacement for a loop command. However, I would like to know if there is a way to filter a CSV file using stream, as shown below, where only the repeated records are included in the result and grouped by the Center field.

Initial CSV file

enter image description here

Final result

enter image description here

In addition, the same pair cannot appear in the final result inversely, as shown in the table below:

This shouldn't happen

enter image description here

Is there a way to do it using stream and grouping at the same time, since theoretically, two loops would be needed to perform the task?

6
  • 2
    I guess you meant different record except the id field right ? because it make it all the record different in your example. Commented Aug 4, 2021 at 13:34
  • Are those real names and birthdates? Commented Aug 4, 2021 at 13:40
  • Closely related question/answer: stackoverflow.com/a/47226834/2513200 Commented Aug 4, 2021 at 13:44
  • This one is for numbers, but the ideas apply here as well: stackoverflow.com/a/31341963/2513200 Commented Aug 4, 2021 at 13:57
  • @Bohemian No. The data are fake !!! LOL. Commented Aug 4, 2021 at 14:22

3 Answers 3

0

What I understood from your examples is you consider an entry as duplicate if all the attributes have same value except the ID. You can use anymatch for this:

list.stream().filter(x ->
                list.stream().anyMatch(y -> isDuplicate(x, y))).collect(Collectors.toList())

So what does the isDuplicate(x,y) do?

This returns a boolean. You can check whether all the entries have same value except the id in this method:

private boolean isDuplicate(CsvEntry x, CsvEntry y) {
    return !x.getId().equals(y.getId())
            && x.getName().equals(y.getName())
            && x.getMother().equals(y.getMother())
            && x.getBirth().equals(y.getBirth());
}

I've assumed you've taken all the entries as String. Change the checks according to the type. This will give you the duplicate entries with their corresponding ID

Sign up to request clarification or add additional context in comments.

5 Comments

This is not an efficient solution.
Thought of using hashset, but that will not give what OP have asked
It can, with the right element type (with appropriate equals method). Or a TreeSet with a custom Comparator.
@devReddit Hi... Thank you so much for your reply. You're right: hashset doesn't solve my problem. Btw, to be able to group the result is it necessary to take the resulting stream above and apply a group to the selection, right ? Can't group in the same command ?
@AdalbertoJoséBrasaca You can use Collectors.groupingBy(Function<? super T,? extends K> classifier) to group the results based on specific property in the collect() of the stream.
0
record Person(
        int id, String name, String mother, LocalDate birth, int center) { }

List<Person> records = List.of(
        new Person(1,  "Antonio Carlos da Silva",  "Ana da Silva",              LocalDate.of(2008, 3, 31),  1),
        new Person(2,  "Carlos Roberto de Souza",  "Amália Maria de Souza",     LocalDate.of(2004, 12, 10), 1),
        new Person(3,  "Pedro de Albuquerque",     "Maria de Albuquerque",      LocalDate.of(2006, 4, 3),   2),
        new Person(4,  "Danilo da Silva Cardoso",  "Sônia de Paula Cardoso",    LocalDate.of(2002, 8, 10),  3),
        new Person(5,  "Ralfo dos Santos Filho",   "Helena dos Santos",         LocalDate.of(2012, 2, 21),  4),
        new Person(6,  "Pedro de Albuquerque",     "Maria de Albuquerque",      LocalDate.of(2006, 4, 3),   2),
        new Person(7,  "Antonio Carlos da Silva",  "Ana da Silva",              LocalDate.of(2008, 3, 31),  1),
        new Person(8,  "Paula Cristina de Abreu",  "Cristina Pereira de Abreu", LocalDate.of(2014, 10, 25), 2),
        new Person(9,  "Rosana Pereira de Campos", "Ivana Maria de Campos",     LocalDate.of(2002, 7, 16),  3),
        new Person(10, "Pedro de Albuquerque",     "Maria de Albuquerque",      LocalDate.of(2006, 4, 3),   2)
);

record PersonKey(String name, String mother, LocalDate birth, int center) {
    PersonKey(Person p) {
        this(p.name(), p.mother(), p.birth(), p.center());
    }
}

List<Person> result = records.stream()
        .collect(Collectors.groupingBy(PersonKey::new))
        .values()
        .stream()
        .filter(l -> l.size() > 1)
        .flatMap(Collection::stream)
        .sorted(Comparator.comparing(Person::center).thenComparing(Person::id))
        .toList();

This approach collects the list to a Map<PersonKey, List<Person>>, where PersonKey are all the fields that are involved in the duplicate record check. Next, only the List<Person> values of this map are kept and are converted to a Stream<List<Person>>. Every element in this stream with at least two elements are kept (i.e. the ones that are duplicates). This is flattened into a Stream<Person>, and then sorted by center with id as the tie-breaker. Finally, the stream is collected into a List<Person>.

Comments

0

You can do it in one pass as a stream with O(n) efficiency:

class PersonKey {
    // have a field for every column that is used to detect duplicates
    String center, name, mother, birthdate;
    public PersonKey(String line) {
        // implement String constructor
    }
    // implement equals and hashCode using all fields
}

List<String> lines; // the input 
Set<PersonKey> seen = ConcurrentHashMap.newKeySet(); // threadsafe
List<String> unique = lines.stream()
  .filter(p -> !seen.add(new PersonKey(p))
  .distinct()
  .collect(toList());

The trick here is that a HashSet has constant time operations and its add() method returns false if the value being added is already in the set, true otherwise.

9 Comments

+1 although personally, if I have to use 2 stateful filters in a stream, I'd prefer the loop. If I'm not mistaken, the found-Set could be dropped if we can live with e.g. a LinkedHashSet-result instead of a list.
@hulk you are mistaken :) the problem is the objects being collected in the set are not the objects needed in the result. You could avoid the 2nd set by by using distinct after mapping to an object whose equals method ignored its id field, which is a violation of all things good. But internally distinct uses a HashSet anyway, so this code is no heavier - you’re just seeing the set. A LinkedHashSet is not needed - either way always finds the first unique item.
Ah I see, I missed that indeed. Well then, this is probably as good as it gets :)
You can probably use a csv driver to run that as a sql query
@Bohemian I don't think I expressed myself correctly. What I need to know is a list of which records are duplicates, not just whether they are. So I think using hashset doesn't solve the problem. Btw, thank you so much for your reply.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.