purging a huge data mysql table using python

Question

I have a 1000M data table where i need to have a automated script just keeping last 7 days and delete the before days. I want to do it using python and chunks concept. Want to delete chunk wise.

do we have any library with this chunk concept related to mysql on python?

If no, can anyone suggest me a best method of how to use chunk or apply this with mysql

use a ARTITIONED TABLE and split by day of month (31) partions. so you can drop each day easy. see : dev.mysql.com/doc/refman/5.7/en/partitioning-types.html or mariadb.com/de/resources/blog/… — Bernd Buffen
– Bernd Buffen, Commented Dec 30, 2021 at 21:42
Thanks Bernd "delete a from tab1 a left join tab2 b on a.subid=b.subid where b.subid is NULL and a.a_id between 2000 and 2500 ;" trying this way in chunks but deletion isn't happening anything wrong with the query , removing the id between then query working — daina
– daina, Commented Dec 30, 2021 at 22:05
can you pls show the create of both table and some sample data — Bernd Buffen
– Bernd Buffen, Commented Dec 30, 2021 at 22:37

Michael Ruth · Accepted Answer · 2021-12-30 23:02:30Z

I'm unaware of a Python package that has an API for "chunking" deletes from a MySQL table. SqlAlchemy provides a fluent interface that can do this but it's not much different than the SQL. I suggest using PyMySql.

import datetime

import pymysql.cursors


connection = pymysql.connect(
    host='host',
    user='user',
    password='password',
    database='database'
)
seven_days_before_now = datetime.datetime.now() - datetime.timedelta(days=7)
chunksize = 1000
with connection.cursor() as cursor:
    sql = 'DELETE FROM `mytable` WHERE `timestamp` < %s ORDER BY `id` LIMIT %s;'
    num_deleted = None
    while num_deleted != 0:
        num_deleted = cursor.execute(sql, (seven_days_before_now, chunksize))
        connection.commit()

The LIMIT just limits the number of deleted rows to the chunksize. The ORDER BY ensures that the DELETE is deterministic and it sorts by the primary key because the primary key is guaranteed to be indexed; so even though it sorts for each chunk, at least it's sorting on an indexed column. Remove the ORDER BY if deterministic behavior is unnecessary, it will result in faster execution time. You'll need to replace the connection details, table name, column name and chunksize. Also, this solution assumes that the table has a column named id which is the primary key and an auto-incrementing integer. You'll need to make some changes if your schema differs.

As Bernd Buffen commented: the correct way to get the behavior you desire is to partition the table. Please consider a migration to do so.

And, of course: stop using Python 2, it's been unsupported for almost two years as of the first version of this answer.

Collectives™ on Stack Overflow

purging a huge data mysql table using python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related