Python threading to process database rows

Question

Given the simple code below:

cur.execute("SELECT * FROM myTable")

for row in cur.fetchall():
    process(row)

If the database is large, this will take a long time to process all the rows. How do I use multi-threading here to help me? Specifically, each row should only be processed once, so when multi-threading, each thread should only process on ID.

Do I need to basically get the total length of database, decide how many threads I want, and then basically tell each thread the min/max id it should filter through?

tdelaney · Accepted Answer · 2017-02-06 19:55:44Z

4

Because the python GIL limits parallelism in threads, you are likely better off using subprocesses instead of threads. You can use a pool that limits the number of concurrent processes so that you don't over-commit your server. The multiprocessing module should do the trick. The cursor is already an iterator and with a small chunksize when calling map, it will only pull data as needed. I don't know the size of your rows or the processing time envolved so I picked a modest chunksize and the default pool size which is the number of cores on your machine.

import multiprocessing as mp

pool = mp.Pool()

cur.execute("SELECT * FROM myTable")

for result in pool.map(process, cur, chunksize=5):
    print(result)

You can experiment with a thread pool by substituting

import multiprocessing.Pool
pools = multiprocessing.Pool.ThreadPool()

answered Feb 6, 2017 at 19:55

tdelaney

77.9k6 gold badges91 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python threading to process database rows

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related