1

I have a 1000M data table where i need to have a automated script just keeping last 7 days and delete the before days. I want to do it using python and chunks concept. Want to delete chunk wise.

do we have any library with this chunk concept related to mysql on python?

If no, can anyone suggest me a best method of how to use chunk or apply this with mysql

3
  • 3
    use a ARTITIONED TABLE and split by day of month (31) partions. so you can drop each day easy. see : dev.mysql.com/doc/refman/5.7/en/partitioning-types.html or mariadb.com/de/resources/blog/… Commented Dec 30, 2021 at 21:42
  • Thanks Bernd "delete a from tab1 a left join tab2 b on a.subid=b.subid where b.subid is NULL and a.a_id between 2000 and 2500 ;" trying this way in chunks but deletion isn't happening anything wrong with the query , removing the id between then query working Commented Dec 30, 2021 at 22:05
  • can you pls show the create of both table and some sample data Commented Dec 30, 2021 at 22:37

1 Answer 1

2

I'm unaware of a Python package that has an API for "chunking" deletes from a MySQL table. SqlAlchemy provides a fluent interface that can do this but it's not much different than the SQL. I suggest using PyMySql.

import datetime

import pymysql.cursors


connection = pymysql.connect(
    host='host',
    user='user',
    password='password',
    database='database'
)
seven_days_before_now = datetime.datetime.now() - datetime.timedelta(days=7)
chunksize = 1000
with connection.cursor() as cursor:
    sql = 'DELETE FROM `mytable` WHERE `timestamp` < %s ORDER BY `id` LIMIT %s;'
    num_deleted = None
    while num_deleted != 0:
        num_deleted = cursor.execute(sql, (seven_days_before_now, chunksize))
        connection.commit()

The LIMIT just limits the number of deleted rows to the chunksize. The ORDER BY ensures that the DELETE is deterministic and it sorts by the primary key because the primary key is guaranteed to be indexed; so even though it sorts for each chunk, at least it's sorting on an indexed column. Remove the ORDER BY if deterministic behavior is unnecessary, it will result in faster execution time. You'll need to replace the connection details, table name, column name and chunksize. Also, this solution assumes that the table has a column named id which is the primary key and an auto-incrementing integer. You'll need to make some changes if your schema differs.

As Bernd Buffen commented: the correct way to get the behavior you desire is to partition the table. Please consider a migration to do so.

And, of course: stop using Python 2, it's been unsupported for almost two years as of the first version of this answer.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.