Postgresql slow delete from where exists

Question

I'm having trouble with slow delete queries. I have a schema ,say "target" containing tables that all have an equivalent table (identical columns & primary keys) in another one, say "delta". I now want to delete all rows that appear in the delta schema from the target schema. I have tried this using the DELETE FROM WHERE EXISTS approach, but that seems incredibly slow. Here's an example query:

DELETE FROM "target".name2phoneme
WHERE EXISTS(
  SELECT 1 FROM delta.name2phoneme d 
  WHERE name2phoneme.NAME_ID = d.NAME_ID 
  AND name2phoneme.PHONEME_ID = d.PHONEME_ID
);

This is the layout of both tables (whith the exception that the "delta" schema only has primary keys and no foreign keys)

CREATE TABLE name2phoneme
(
  name_id uuid NOT NULL,
  phoneme_id uuid NOT NULL,
  seq_num numeric(3,0),
  CONSTRAINT pk_name2phoneme PRIMARY KEY (name_id, phoneme_id),
  CONSTRAINT fk_name2phoneme_name_id_2_name FOREIGN KEY (name_id)
    REFERENCES name (name_id) MATCH SIMPLE
    ON UPDATE NO ACTION
    ON DELETE NO ACTION
    DEFERRABLE INITIALLY DEFERRED,
  CONSTRAINT fk_name2phoneme_phoneme_id_2_phoneme FOREIGN KEY (phoneme_id)
    REFERENCES phoneme (phoneme_id) MATCH SIMPLE
    ON UPDATE NO ACTION
    ON DELETE NO ACTION
    DEFERRABLE INITIALLY DEFERRED
)

The "target" table originally contains a little over 18M rows, while the delta table contains about 3.7M rows (that are to be deleted from the target).

Here's the output of EXPLAIN of the above query:

"Delete on name2phoneme  (cost=154858.03..1068580.46 rows=6449114 width=12)"
"  ->  Hash Join  (cost=154858.03..1068580.46 rows=6449114 width=12)"
"        Hash Cond: ((name2phoneme.name_id = d.name_id) AND (name2phoneme.phoneme_id = d.phoneme_id))"
"        ->  Seq Scan on name2phoneme  (cost=0.00..331148.16 rows=18062616 width=38)"
"        ->  Hash  (cost=69000.01..69000.01 rows=3763601 width=38)"
"              ->  Seq Scan on name2phoneme d  (cost=0.00..69000.01 rows=3763601 width=38)"

I tried to EXPLAIN ANALYZE the above query, but execution took over 2hrs so I killed it.

Any ideas on how I can optimize this operation?

There is a lot of overhead to deleting 3.7 million rows. I would suggest creating a new table with the rows you want, truncating the existing table, and re-inserting the rows you want back in. — Gordon Linoff
– Gordon Linoff, Commented Nov 20, 2017 at 22:24
You can try using WHERE (name_id, phoneme_id) IN (SELECT name_id, phoneme_id FROM other_table) and adding an index on (name_id, phoneme_id), but given the size of the tables I wouldn’t expect anything blazing fast... — jcaron
– jcaron, Commented Nov 20, 2017 at 22:28
@jcaron I tried explain DELETE FROM "target".name2phoneme WHERE (name_id, phoneme_id) in (SELECT d.name_id, d.phoneme_id FROM "delta".name2phoneme d);, which resulted in the same costs as the WHERE EXISTS approach. — Wouter Van Daele
– Wouter Van Daele, Commented Nov 20, 2017 at 22:42
@GordonLinoff Creating a new table with the rows I want would require an INSERT INTO ... WHERE NOT EXISTS construct, which would only yield similar results (at best), no? We create delta tables every week, so the number of rows in the delta schema can be a lot less (even 0). Wouldn't that be even less efficient if there are only a couple of rows to be deleted? — Wouter Van Daele
– Wouter Van Daele, Commented Nov 20, 2017 at 23:06

Gordon Linoff · Accepted Answer · 2017-11-21 03:34:49Z

1

Deleting 3.7 million rows is very time consuming, because of the overhead of looking up each row and then logging and deleting the rows. Just thinking about all the dirty pages, logging, and cache misses is mind-boggling -- not to mention updates to the indexes as well.

For that reason, something like this can be much faster:

create temporary table temp_n2p as 
    select n2p.*
    from "target".name2phoneme n2p
    where not exists (select 1
                      from delta.name2phoneme d 
                      where n2p.NAME_ID = d.NAME_ID and
                            n2p.PHONEME_ID = d.PHONEME_ID
                     );

truncate table "target".name2phoneme;

insert into "target".name2phoneme
    select *
    from temp_n2p;

You should also drop the indexes before the truncation and then recreate them afterwards.

edited Nov 21, 2017 at 3:34

answered Nov 21, 2017 at 0:08

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Wouter Van Daele Over a year ago

Makes sense. I will try this approach, thank you. But shouldn't it be where NOT exists?

Gordon Linoff Over a year ago

@WouterVanDaele . . . Yes, that would be the correct logic.

Paul Maxwell · Accepted Answer · 2017-11-21 00:06:22Z

0

Have you tried either of these approaches:

DELETE 
FROM "target".name2phoneme t  
     USING delta.name2phoneme d 
WHERE t.NAME_ID = d.NAME_ID 
      AND t.PHONEME_ID = d.PHONEME_ID               
;

Or using WITH, but Postgres does materialize CTEs so I'm not confident this is wise at your scale of need.

WITH cte AS (
      SELECT t.name_id, t.phoneme_id
      FROM "target".name2phoneme t  
      INNER JOIN delta.name2phoneme d ON t.NAME_ID = d.NAME_ID 
                            AND t.PHONEME_ID = d.PHONEME_ID               
      )
DELETE FROM "target".name2phoneme t
     USING cte d
WHERE t.NAME_ID = d.NAME_ID 
      AND t.PHONEME_ID = d.PHONEME_ID               
;

answered Nov 21, 2017 at 0:06

Paul Maxwell

35.7k4 gold badges39 silver badges55 bronze badges

Collectives™ on Stack Overflow

Postgresql slow delete from where exists

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related