4

I want to remove duplicates from a large table having about 1million rows and increasing every hour. It has no unique id and has about ~575 columns but sparsely filled.

The table is 'like' a log table where new entries are appended every hour without unique timestamp.

The duplicates are like 1-3% but I want to remove it anyway ;) Any ideas?

I tried ctid column (as here) but its very slow.

1 Answer 1

4

The basic idea that works generally well with PostgreSQL is to create an index on the hash of the set of columns as a whole.

Example:

CREATE INDEX index_name ON tablename (md5((tablename.*)::text));

This will work unless there are columns that don't play well with the requirement of immutability (mostly timestamp with time zone because their cast-to-text value is session-dependent).

Once this index is created, duplicates can be found quickly by self-joining with the hash, with a query looking like this:

SELECT t1.ctid, t2.ctid
FROM tablename t1 JOIN tablename t2
 ON (md5((t1.*)::text) = md5((t2.*)::text))
WHERE t1.ctid > t2.ctid;

You may also use this index to avoid duplicates rows in the future rather than periodically de-duplicating them, by making it UNIQUE (duplicate rows would be rejected at INSERT or UPDATE time).

Sign up to request clarification or add additional context in comments.

8 Comments

Didn't work for me, since I have timestamps with TZ, but I like the approach.
I'm trying to do something similar but with a timestamptz. However I know for a fact that the database is UTC and (obviously) so are the timestamps; I'm looking for a workaround to immutability. I want to nuke a few thousand duplicate rows out of millions of rows so I can recreate the primary key...
@Jeff: if you have a candidate primary key then you don't need the above method. Create an index on it, then eliminate the duplicates with a self-join that presumably will use that index, then drop the index, then set the unique constraint.
Thanks for the lead! I will try a few things, otherwise I'll create a new question here and let you know.
@OmriShneor: this answer appears to be obsolete and if you need to deduplicate only on some columns, that's not the same question anyway. Please submit a new question.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.