2

Could someone please point me in the right direction on how to solve this following problem. I am trying to come up with a solution using pandas.read_sql and asyncio. I want to migrate table records from 1 database to another database.

I want to do the following:

table 1
.
.
.
table n

I have the function:

def extract(table):
    try:
        df = pd.DataFrame()
        df = pd.concat(
              [chunk for chunk in
                  pd.read_sql(sql,
                              con=CONNECTION,
                              chunksize=10**5)]
                    )
    except Exception as e:
        raise e
    else:
        return df

I want to run these in parallel and not one by one.

extract(table1)
extract(table2)
.
.
extract(tablen)
5
  • Is asyncio a hard requirement? Have you considered threads or multiprocessing? Commented Jul 19, 2018 at 16:40
  • yeah, but maybe i could get some idea using threads or multiprocessing. but ive heard that theres a lot of problems that can occur using those methods. Commented Jul 19, 2018 at 17:18
  • Even if asyncio were a hard requirement, an asyncio-based solution would still use threads under the hood to run DataFrame.read_sql in parallel. With that in mind, it is better to use concurrent.futures, which provides excellent tools for parallelizing code. Commented Jul 20, 2018 at 15:43
  • So what is the solution? I have encountered a similar case. @user4815162342 Commented Jul 26, 2022 at 9:13
  • @Algo The accepted answer shows it. Commented Jul 27, 2022 at 5:30

1 Answer 1

4

asyncio is about organizing non-blocking code into callbacks and coroutines. Running CPU-intensive code in parallel is a use case for threads:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor() as executor:
    frames = list(executor.map(extract, all_tables))

Whether this will actually run faster than sequential code depends on whether pd.read_sql releases the GIL.

Sign up to request clarification or add additional context in comments.

7 Comments

is there any way to check or release it in python code? something like: while fetching: release GIL for other function to run in parallel?
another question: can u manage the result immediately on a ThreadPoolExecutor or do you need to wait untill all the extract(tab1...tabn) are finished?
@Maki You can't release the GIL in Python, but C extensions can do it when safe. (Panda's authors are aware of this.) If you need to manage results as they arrive, look into executor's submit method. It returns a Future which you can manage in various ways, including registering a callback to be executed when the result is ready.
@Maki Also see as_completed, an iterator which accepts a bunch of futures and yields them as they finish.
As of pandas 2.1.4, the GIL is released, which means this answer indeed speeds up queries by allowing them to run concurrently; no need for multiprocessing in this case. :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.