1

This query needs to delete over 17 million rows, from a table containing 20 million.

DELETE
FROM statements
WHERE agreement_id IN
    (SELECT id
     FROM agreements
     WHERE created < DATE_SUB(CURDATE(), INTERVAL 6 MONTH));


DELETE
FROM agreements
WHERE created < DATE_SUB(CURDATE(), INTERVAL 6 MONTH)

It takes hours to run, am I missing something that could speed things up a bit?

The subselect by itself takes a few seconds, I don't understand why the delete takes so long.

2
  • Can we have the structure of statements table, maybe you should make agreement_id and index. Commented Jan 10, 2019 at 19:49
  • Every query-optimization question should include the output of SHOW CREATE TABLE <tablename> for each table referenced in the query. Help us help you — don't make us guess at which data types and indexes you currently have. Commented Jan 10, 2019 at 21:36

3 Answers 3

1

If you have this much delete to be undertaken. I suggest you to:

  1. create new temporary table with the data which will stay.
  2. Truncate your main table
  3. Move data from temporary table to your main table

or

  1. create new temporary table with the data which will stay.
  2. Drop your main table
  3. Rename your Temp table as main table (dont forget to create constraints)

Also for your query,

never use IN clause for BIG data. Instead use exists which is more performant.

Basic script:

CREATE TABLE tmp_statements as
  SELECT * FROM statements s where exists 
  (
     select 1 FROM agreements a 
     WHERE 
       created < DATE_SUB(CURDATE(), INTERVAL 6  MONTH AND
       s.agreement_id = a.agreement_id
  ));

 DROP TABLE statements;

 RENAME TABLE tmp_statements TO statements ;

 --DONT FORGET TO RECREATE  CREATE YOUR INDEXES, CONSTRAINTS;
Sign up to request clarification or add additional context in comments.

1 Comment

As I wrote in comment above who linked to same approach, upon reading this I was sceptical. I needed to build a minimal dataset for local dev, after trying alternate methods, I went with this approach. By far the quickest (and a tad dirty ) way to do it. Thanks!
1

Try to rewrite the first statement to use EXISTS.

DELETE FROM statements
            WHERE EXISTS (SELECT *
                                 FROM agreements
                                 WHERE agreements.id = statements.aggreement_id
                                       AND agreements.created < date_sub(curdate(), interval 6 month));

And put an index on agreements (id, created) (if not already there).

CREATE INDEX agreements_id_created
             ON agreements
                (id,
                 created);

For the second one create an index on agreements (created) (if not already there).

CREATE INDEX agreements_created
             ON agreements
                (created);

Comments

1

Use a "multi-table delete" instead of the usually inefficient IN ( SELECT ... ).

Several techniques for large deletes are discussed here.

To delete 85% of the table, it is really best to build a new table with the 15% you are keeping, then swap the table into place. (More on that in the link above.)

2 Comments

Upon reading this I was sceptical. I needed to build a minimal dataset for local dev, after trying alternate methods, I went with this approach. By far the quickest and (a tad) dirty way to do it.
@stefgosselin - In real life, kludges are sometimes OK. (I assume you were talking about the "copy rows to keep" method?)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.