Step 1
We have a delete query of the form that we are trying to speed up:
DELETE * FROM table_name
WHERE col_name in ('a','b',....'zzzz');
The operation deletes between 0.5-50% of the mass of the table. col_name is an indexed (non-unique) column.
This ran extremely slowly because each delete affected the index.
Step 2
We used a non-indexed tombstone boolean column called deleted with a DEFAULT FALSE. Our query now became:
UPDATE table_name
SET deleted = TRUE
WHERE col_name in ('a','b',....'zzzz');
This definitely runs quicker (60-200%), but seems to ignore the col_name index for large IN clauses. However, since the update only applies to an unindexed column, it is fast.
Step 3
We replaced the conditional to be:
UPDATE table_name
SET deleted = TRUE
WHERE col_name = 'a'
OR col_name = 'b'
OR ...
OR col_name = 'zzzz';
Even though this utilizes the index, it runs at about the same speed as the DELETE from Step 1.
Is there a fast way to delete (or mark as deleted) a number of rows based on membership within a very large IN clause?
The database needs no concurrency handling as it is accessed by a dedicated single-threaded application.
Note: Individually performing the deletes/updates was an order of magnitude slower. The IN clause generally has between 20000 and 5 million elements.
explain (analyze)to find out if finding the rows is the slow part or deleting them.CREATE TABLEstatement showing data types and constraints), cardinalities, relevant resources. See tag info of [postgresql-performance] for instructions.vacuum (analyze, buffers)to figure out what the actual problem is.