postgreSQL query seems to be running on infinite loop

Question

Following my previous question, I am now trying to remove duplicates from my database. I am first running a sub-query to identify the almost identical records (the only difference would be the index column "id"). My table has roughly 9 million records and the below code had to be interrupted after roughly 1h30

DELETE FROM public."OptionsData" 
WHERE id NOT IN
(
    SELECT id FROM (
        SELECT DISTINCT ON (asofdate, contract, strike, expiry, type, last, bid, ask, volume, iv, moneyness, underlying, underlyingprice) * FROM public."OptionsData"
    ) AS TempTable
);

Producing the results from the sub-query takes about 1 minute, so maybe running the full query might take a long time (?) or is there something off in my code please?

Total number of records: 8'764'239 The sub-query indicates: 8'681'440 unique records So 82'799 duplicates Humm then assuming 1 sec to delete a records, that would require almost 23hours (?) — Hotone
– Hotone, Commented Aug 24, 2020 at 13:21

user330315 · Accepted Answer · 2020-08-24 13:20:18Z

3

NOT IN combined with a DISTINCT is usually quite slow.

To delete duplicates using EXISTS is typically faster:

DELETE FROM public."OptionsData"  d1
WHERE EXISTS (select *
              from public."OptionsData" d2
              where d1.id > d2.id
                and (d1.asofdate, d1.contract, d1.strike, d1.expiry, d1.type, d1.last, d1.bid, d1.ask, d1.volume, d1.iv, d1.moneyness, d1.underlying, d1.underlyingprice) 
                    = (d2.asofdate, d2.contract, d2.strike, d2.expiry, d2.type, d2.last, d2.bid, d2.ask, d2.volume, d2.iv, d2.moneyness, d2.underlying, d2.underlyingprice)
              )

This will keep the rows with the smallest value in id. If you want to keep those with the highest id use where d1.id < d2.id.

answered Aug 24, 2020 at 13:20

user330315

Sign up to request clarification or add additional context in comments.

1 Comment

Hotone Over a year ago

Thanks, EXISTS is definitely faster!!

Collectives™ on Stack Overflow

postgreSQL query seems to be running on infinite loop

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related