3

I have a local PostgreSQL database that was created using a Rails application. It has 600k records, of which ~200k are duplicates. I want to keep only 1 of the record and delete the duplicates. I write SQL everyday for work but Rails is my hobby, and still I struggle with ActiveRecord.

Here's how I found the duplicates (in Rails console):

Summary.select(:map_id).group(:map_id).having("count(*) > 1")

I don't think I can simply add destroy_all to the end of that statement as it will destroy all instances of that entry, including the duplicate values.

Could you please tell me how to update this so that it removes the duplicates?

3
  • 1
    If you know your way around SQL why don't you just do it in SQL? Commented May 30, 2016 at 0:39
  • For some reason I thought using pure SQL in rails was hard. I've done this a few times in SQL. One would be order by map_id and select first. Other would be order by, creating a count column, and selecting where that count column = some_number (used in case you didn't want the first, but rather the 2nd or 3rd observation to be preserved). Commented May 30, 2016 at 0:44
  • Using raw SQL is easy in Rails, I do it all the time because ActiveRecord only understands baby talk SQL. Commented May 30, 2016 at 1:21

3 Answers 3

4

This will destroy the duplicates in waves, selecting only a single duplicate per map_id, on each pass. The loop will automatically finish when no more duplicates exist.

loop do
  duplicates = Summary.select("MAX(id) as id, map_id").group(:map_id).having("count(*) > 1")
  break if duplicates.length == 0
  duplicates.destroy_all
end

If the database looks like this:

| id | map_id |
|  1 |    235 |
|  2 |    299 |
|  3 |    324 |
|  4 |    235 |
|  5 |    235 |
|  6 |    299 |
|  7 |    235 |
|  8 |    324 |
|  9 |    299 |

In the first wave, these records would be returned and destroyed:

| id | map_id |
|  7 |    235 |
|  8 |    324 |
|  9 |    299 |

In the second wave, this record would be returned and destroyed:

| id | map_id |
|  5 |    235 |
|  6 |    299 |

The third wave would return and destroy this record:

| id | map_id |
|  4 |    235 |

The fourth wave would complete the process. Unless there are numerous duplicates for a given map_id, it's likely that this process will finish in single-digit loop iterations.

Given the approach, only duplicates will ever be returned, and only the newer duplicates will be removed. To remove older duplicates, instead, the query can be changed to this:

  duplicates = Summary.select("MIN(id) as id, map_id").group(:map_id).having("count(*) > 1")

In that case, wave 1 would return and destroy:

| id | map_id |
|  1 |    235 |
|  2 |    299 |
|  3 |    324 |

Wave 2 would return and destroy:

| id | map_id |
|  4 |    235 |
|  6 |    299 |

Wave 3 would return and destroy:

| id | map_id |
|  5 |    235 |

Wave 4 would complete the process.

Sign up to request clarification or add additional context in comments.

2 Comments

Takes a little while to finish for 200k duplicates, but it works. I'm working on my scraping logic to reduce the amount of duplicates generated.
That's great to hear! When you use it on subsequent runs, it should be much faster. 200K is a lot of records to destroy.
3

I would go to the db console(rails dbconsole) and do:

SELECT DISTINCT ON (map_id) * FROM summaries AS some_temp_name;

Then rename the tables.

EDIT - This seems like what you're looking for:

Summary.where.not(id: Summary.group(:map_id).pluck('min(summaries.id)')).delete_all

NOT TESTED. It was part of this answer here: Rails: Delete duplicate records based on multiple columns

5 Comments

Would definitely work, just wanted something less hackish since it'll be part of the web scraping code and thus be executed more than once.
@nonegiven72: Why would you be doing this more than once? Presumably you'd clean up the mess of duplicates, add a UNIQUE constraint to prevent them from happening again, and then check for duplicates (and catch the unique violation exception from the constraint) before adding/updating.
My other applications have had this, but the unique check was only when someone created an account, and honestly that wasn't too often. I worried that when scraping 500k records in an hour, the checking of uniqueness for each one might slow down the process, and that it'd be easier just to remove them at the end.
@nonegiven72: If you're worried about performance then you want to bypass the ORM as much as possible anyway and pump data straight into the database, then throw a little SQL at the database to clean things up en mass. Trying to do everything through Rails is politically correct and will make the fan boys smile but it is often silly and wasteful. IMO of course.
@nongiven72: Please see edit. I think this what you were looking for. You can do to_sql instead of delete_all to see what gets generated.
1

What I would suggest to do, is to fetch all the records and order by the field that has duplicates.

Then loop all the records and just keep one record per value.

value = nil
Summary.order("map_id ASC").each do |record|
  if record.map_id == value
    # duplicate
    record.destroy
  else
    # first entry
    value = record.map_id
  end
end

2 Comments

I get the explanation, but in the code I don't understand the value = nil portion.
You need to initialize the variable to nil for the first iteration of the loop.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.