ASYNC - Pandas read_sql and asyncio?

Question

Could someone please point me in the right direction on how to solve this following problem. I am trying to come up with a solution using pandas.read_sql and asyncio. I want to migrate table records from 1 database to another database.

I want to do the following:

table 1
.
.
.
table n

I have the function:

def extract(table):
    try:
        df = pd.DataFrame()
        df = pd.concat(
              [chunk for chunk in
                  pd.read_sql(sql,
                              con=CONNECTION,
                              chunksize=10**5)]
                    )
    except Exception as e:
        raise e
    else:
        return df

I want to run these in parallel and not one by one.

extract(table1)
extract(table2)
.
.
extract(tablen)

Is asyncio a hard requirement? Have you considered threads or multiprocessing? — dirn
– dirn, Commented Jul 19, 2018 at 16:40
yeah, but maybe i could get some idea using threads or multiprocessing. but ive heard that theres a lot of problems that can occur using those methods. — Maki
– Maki, Commented Jul 19, 2018 at 17:18
Even if asyncio were a hard requirement, an asyncio-based solution would still use threads under the hood to run DataFrame.read_sql in parallel. With that in mind, it is better to use concurrent.futures, which provides excellent tools for parallelizing code. — user4815162342
– user4815162342, Commented Jul 20, 2018 at 15:43
So what is the solution? I have encountered a similar case. @user4815162342 — Algo
– Algo, Commented Jul 26, 2022 at 9:13

user4815162342 · Accepted Answer · 2018-07-20 08:19:42Z

4

asyncio is about organizing non-blocking code into callbacks and coroutines. Running CPU-intensive code in parallel is a use case for threads:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor() as executor:
    frames = list(executor.map(extract, all_tables))

Whether this will actually run faster than sequential code depends on whether pd.read_sql releases the GIL.

edited Jul 20, 2018 at 8:19

answered Jul 20, 2018 at 6:38

user4815162342

159k22 gold badges350 silver badges418 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Maki Over a year ago

is there any way to check or release it in python code? something like: while fetching: release GIL for other function to run in parallel?

Maki Over a year ago

another question: can u manage the result immediately on a ThreadPoolExecutor or do you need to wait untill all the extract(tab1...tabn) are finished?

user4815162342 Over a year ago

@Maki You can't release the GIL in Python, but C extensions can do it when safe. (Panda's authors are aware of this.) If you need to manage results as they arrive, look into executor's submit method. It returns a Future which you can manage in various ways, including registering a callback to be executed when the result is ready.

user4815162342 Over a year ago

@Maki Also see as_completed, an iterator which accepts a bunch of futures and yields them as they finish.

Marius Wallraff Over a year ago

As of pandas 2.1.4, the GIL is released, which means this answer indeed speeds up queries by allowing them to run concurrently; no need for multiprocessing in this case. :)

|

Collectives™ on Stack Overflow

ASYNC - Pandas read_sql and asyncio?

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related