Entity Framework - remove duplicates

Question

I want to remove duplicate records using Entity Framework.

This is what I've tried

var result = _context.History
            .GroupBy(s => new
                    {
                        s.Date,
                        s.EventId
                    })
            .SelectMany(grp => grp.Skip(1)).ToList();

_context.History.RemoveRange(result);
await _context.SaveChangesAsync();

But I get an error

System.InvalidOperationException: Processing of the LINQ expression 'grp => grp.Skip(1)' by 'NavigationExpandingExpressionVisitor' failed. This may indicate either a bug or a limitation in EF Core

I understand that this is breaking change for Entity Framework, but I really don't know how to update my code.

Don't use EF Core in the first place. EF Core is an ORM, not a SQL replacement. There are no Objects here, and the easiest and most efficient way to remove duplicates involves a CTE with ROW_NUMBER() that would return all multiples, ranked by whatever sort order you want, allowing you to select which row to keep — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Nov 22, 2020 at 17:53
Eg with dups as (select *, row_number() over (partition by date,eventid order by id desc) rn from...) delete dups where rn>1 will delete all duplicates except the largest id. The CTE doesn't need to return all columns, just the key columns are enough. You can specify a different ORDER BY to select different rows to preserve — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Nov 22, 2020 at 17:56
@PanagiotisKanavos Is this CTE database agnostic or just SqlServer specific? ORM might not be a SQL replacement, but LINQ is supposed to be abstraction and database agnostic language integrated query language, so why don't use it? The fact that EF Core breaks the contract by not willing to translate it doesn't mean OP is doing something wrong. — Ivan Stoev
– Ivan Stoev, Commented Nov 22, 2020 at 18:02
@IvanStoev the operation isn't object-agnostic. There are no objects involved. LINQ wasn't meant to handle such situations. ORMs were never meant for reporting queries or fully replacing SQL. If you replace database agnostic with ANSI standard, yes, it's ANSI standard and even supported in MySQL after MySQL 8. All other major databases had ROW_NUMBER() already — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Nov 22, 2020 at 18:04
@IvanStoev and SQLite added windowing functions in version 3.25. Besides, what the OP is trying to do doesn't make sense in SQL - that group isn't really grouping and there's no SKIP in SQL. This is trying to apply (somewhat inefficient) LINQ-to-Objects operation to a database hoping that EF Core can somehow translate this to SQL — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Nov 22, 2020 at 18:07

ochzhen · Accepted Answer · 2020-11-23 17:46:14Z

1

Looks like Entity Framework doesn't know how to translate this Skip part of LINQ query. Moreover, it cannot make translate this GroupBy part. In EF Core 3 it will throw an exception to let us know :)

So, a dirty but simple way is to add AsEnumerable almost at the beginning, however, it will fetch all table and perform operations in memory:

var result = _context.History
            .AsEnumerable()
            .GroupBy(s => new { s.Date, s.EventId })
            .SelectMany(g => g.Skip(1))
            .ToList();

_context.History.RemoveRange(result);
await _context.SaveChangesAsync();

Since in most cases it's not acceptable to fetch everything we can split first request into two so that we download only duplicated records.

Second answer of this question might help, we can try something like this:

var keys = _context.History
                .GroupBy(s => new { s.Date, s.EventId })
                .Select(g => new { g.Key, Count = g.Count() })
                .Where(t => t.Count > 1)
                .Select(t => new { t.Key.Date, t.Key.EventId })
                .ToList();

var result = _context.History
    .Where(h => keys.Any(k => k.Date == h.Date && k.EventId == h.EventId))
    .AsEnumerable()
    .GroupBy(s => new { s.Date, s.EventId })
    .SelectMany(g => g.Skip(1))
    .ToList();

_context.History.RemoveRange(result);
await _context.SaveChangesAsync();

edited Nov 23, 2020 at 17:46

answered Nov 22, 2020 at 21:55

ochzhen

1562 silver badges5 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Gerald Hughes Over a year ago

Hello, welcome to SO. This is the error I get System.InvalidOperationException: Client side GroupBy is not supported.

ochzhen Over a year ago

Interesting, I didn't know it but looks like in EF Core 3 they added an explicit error since GroupBy is not being translated to SQL, second answer here is quite good: stackoverflow.com/questions/58138556/… The easiest solution is to move AsEnumerable() to the top right after _context.History. However, it will fetch all data from this table to the server and perform everything in memory. Is it acceptable in your case?

ochzhen Over a year ago

I've updated the answer so that it's easier for you to understand my previous comment. It might help :)

Milos Gak · Accepted Answer · 2020-11-24 14:40:39Z

1

In this case you are grouping by both columns:

var duplicate = DB.History.GroupBy(x => new { x.Date, x.EventId})
                         .Where(x => x.Count() > 1)
                         .SelectMany(x => x.ToList());

answered Nov 24, 2020 at 14:40

Milos Gak

211 silver badge4 bronze badges

Collectives™ on Stack Overflow

Entity Framework - remove duplicates

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related