Removing duplicates from Rails postgres database

Question

I have a local PostgreSQL database that was created using a Rails application. It has 600k records, of which ~200k are duplicates. I want to keep only 1 of the record and delete the duplicates. I write SQL everyday for work but Rails is my hobby, and still I struggle with ActiveRecord.

Here's how I found the duplicates (in Rails console):

Summary.select(:map_id).group(:map_id).having("count(*) > 1")

I don't think I can simply add destroy_all to the end of that statement as it will destroy all instances of that entry, including the duplicate values.

Could you please tell me how to update this so that it removes the duplicates?

If you know your way around SQL why don't you just do it in SQL? — mu is too short
– mu is too short, Commented May 30, 2016 at 0:39
For some reason I thought using pure SQL in rails was hard. I've done this a few times in SQL. One would be order by map_id and select first. Other would be order by, creating a count column, and selecting where that count column = some_number (used in case you didn't want the first, but rather the 2nd or 3rd observation to be preserved). — nonegiven72
– nonegiven72, Commented May 30, 2016 at 0:44
Using raw SQL is easy in Rails, I do it all the time because ActiveRecord only understands baby talk SQL. — mu is too short
– mu is too short, Commented May 30, 2016 at 1:21

Michael Gaskill · Accepted Answer · 2016-05-30 00:04:05Z

4

This will destroy the duplicates in waves, selecting only a single duplicate per map_id, on each pass. The loop will automatically finish when no more duplicates exist.

loop do
  duplicates = Summary.select("MAX(id) as id, map_id").group(:map_id).having("count(*) > 1")
  break if duplicates.length == 0
  duplicates.destroy_all
end

If the database looks like this:

| id | map_id |
|  1 |    235 |
|  2 |    299 |
|  3 |    324 |
|  4 |    235 |
|  5 |    235 |
|  6 |    299 |
|  7 |    235 |
|  8 |    324 |
|  9 |    299 |

In the first wave, these records would be returned and destroyed:

| id | map_id |
|  7 |    235 |
|  8 |    324 |
|  9 |    299 |

In the second wave, this record would be returned and destroyed:

| id | map_id |
|  5 |    235 |
|  6 |    299 |

The third wave would return and destroy this record:

| id | map_id |
|  4 |    235 |

The fourth wave would complete the process. Unless there are numerous duplicates for a given map_id, it's likely that this process will finish in single-digit loop iterations.

Given the approach, only duplicates will ever be returned, and only the newer duplicates will be removed. To remove older duplicates, instead, the query can be changed to this:

  duplicates = Summary.select("MIN(id) as id, map_id").group(:map_id).having("count(*) > 1")

In that case, wave 1 would return and destroy:

| id | map_id |
|  1 |    235 |
|  2 |    299 |
|  3 |    324 |

Wave 2 would return and destroy:

| id | map_id |
|  4 |    235 |
|  6 |    299 |

Wave 3 would return and destroy:

| id | map_id |
|  5 |    235 |

Wave 4 would complete the process.

answered May 30, 2016 at 0:04

Michael Gaskill

8,06210 gold badges40 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

nonegiven72 Over a year ago

Takes a little while to finish for 200k duplicates, but it works. I'm working on my scraping logic to reduce the amount of duplicates generated.

Michael Gaskill Over a year ago

That's great to hear! When you use it on subsequent runs, it should be much faster. 200K is a lot of records to destroy.

Community · Accepted Answer · 2017-05-23 12:13:54Z

3

I would go to the db console(rails dbconsole) and do:

SELECT DISTINCT ON (map_id) * FROM summaries AS some_temp_name;

Then rename the tables.

EDIT - This seems like what you're looking for:

Summary.where.not(id: Summary.group(:map_id).pluck('min(summaries.id)')).delete_all

NOT TESTED. It was part of this answer here: Rails: Delete duplicate records based on multiple columns

edited May 23, 2017 at 12:13

CommunityBot

11 silver badge

answered May 29, 2016 at 21:39

seph

6,0963 gold badges23 silver badges19 bronze badges

5 Comments

nonegiven72 Over a year ago

Would definitely work, just wanted something less hackish since it'll be part of the web scraping code and thus be executed more than once.

mu is too short Over a year ago

@nonegiven72: Why would you be doing this more than once? Presumably you'd clean up the mess of duplicates, add a UNIQUE constraint to prevent them from happening again, and then check for duplicates (and catch the unique violation exception from the constraint) before adding/updating.

nonegiven72 Over a year ago

My other applications have had this, but the unique check was only when someone created an account, and honestly that wasn't too often. I worried that when scraping 500k records in an hour, the checking of uniqueness for each one might slow down the process, and that it'd be easier just to remove them at the end.

mu is too short Over a year ago

@nonegiven72: If you're worried about performance then you want to bypass the ORM as much as possible anyway and pump data straight into the database, then throw a little SQL at the database to clean things up en mass. Trying to do everything through Rails is politically correct and will make the fan boys smile but it is often silly and wasteful. IMO of course.

seph Over a year ago

@nongiven72: Please see edit. I think this what you were looking for. You can do to_sql instead of delete_all to see what gets generated.

Simone Carletti · Accepted Answer · 2016-05-29 21:43:22Z

1

What I would suggest to do, is to fetch all the records and order by the field that has duplicates.

Then loop all the records and just keep one record per value.

value = nil
Summary.order("map_id ASC").each do |record|
  if record.map_id == value
    # duplicate
    record.destroy
  else
    # first entry
    value = record.map_id
  end
end

answered May 29, 2016 at 21:43

Simone Carletti

177k50 gold badges371 silver badges370 bronze badges

2 Comments

nonegiven72 Over a year ago

I get the explanation, but in the code I don't understand the value = nil portion.

Simone Carletti Over a year ago

You need to initialize the variable to nil for the first iteration of the loop.

Collectives™ on Stack Overflow

Removing duplicates from Rails postgres database

3 Answers 3

2 Comments

5 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related