3

Given the simple code below:

cur.execute("SELECT * FROM myTable")

for row in cur.fetchall():
    process(row)

If the database is large, this will take a long time to process all the rows. How do I use multi-threading here to help me? Specifically, each row should only be processed once, so when multi-threading, each thread should only process on ID.

Do I need to basically get the total length of database, decide how many threads I want, and then basically tell each thread the min/max id it should filter through?

1 Answer 1

4

Because the python GIL limits parallelism in threads, you are likely better off using subprocesses instead of threads. You can use a pool that limits the number of concurrent processes so that you don't over-commit your server. The multiprocessing module should do the trick. The cursor is already an iterator and with a small chunksize when calling map, it will only pull data as needed. I don't know the size of your rows or the processing time envolved so I picked a modest chunksize and the default pool size which is the number of cores on your machine.

import multiprocessing as mp

pool = mp.Pool()

cur.execute("SELECT * FROM myTable")

for result in pool.map(process, cur, chunksize=5):
    print(result)

You can experiment with a thread pool by substituting

import multiprocessing.Pool
pools = multiprocessing.Pool.ThreadPool()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.