Python - Parallel SQL Queries and return dataframes for each

Question

So I've been looking into multiproccessing or parallel processes in Python to perform about a dozen or so SQL queries. Right now, the queries are done serially, and it takes about 4 minutes, where 1 query takes as long as the other 11 to do. So theoretically I could cut my total run time in half at least if I could run all the queries in parallel.

I'm trying to do something along the lines of the following and I haven't been able to find documentation supporting if its really possible with my current thought processes:

So, say I have:

SSMS_query1 = "SELECT * FROM TABLE1"

SSMS_query2 = "SELECT * FROM TABLE2"

HANADB_query3 = "SELECT * FROM TABLE3"

So to connect to SSMS I use:

import pyodbc
server = "server_name"
cnxn = pyodbc.connect("DRIVER={SQL Server};SERVER=" + server + ";trusted_connection=Yes")

Then to connect to my HANAdb's I use:

from hdbcli import dbapi
conn = dbapi.connect(address="", port=, user="", password="")

Then essentially I want to do something where I can take advantage of pooling to save time, like:

import pandas as pd
with cnxn, conn as ssms, hana:
    df1 = pd.read_sql(SSMS_query1, ssms)
    df2 = pd.read_sql(SSMS_query2, ssms)
    df3 = pd.read_sql(HANADB_query3, hana)

I've tried using:

import multiprocessing
import threading

But I can't get the desired output, because eventually I want to output df1, df2, and df3 to excel. So how do I store the dataframes and use them as output later on using parallelism?

Take a look at ThreadPoolExecutor: docs.python.org/3/library/concurrent.futures.html. Your program is clearly IO bound while waiting for the SQL queries — Code Different
– Code Different, Commented Sep 6, 2022 at 19:17
Yeah, reading thru the docs this looks good. I just need to figure out how to exactly apply it to my application. But this is definitely a lot closer to what I need than what I was originally looking at, thank you — smichael_44
– smichael_44, Commented Sep 7, 2022 at 11:38

Booboo · Accepted Answer · 2022-09-07 11:46:22Z

3

I would think that multithreading might be more efficient than multiprocessing not knowing precisely how large the dataframes being created are since in general with multiprocessing there is a lot more overhead in moving results from a child process back to the main process. But since the queries take 4 minutes, I have to assume the amount of data is fairly large. Besides, much of the time spent is in network activity for which multithreading is well-suited.

Here I am assuming the worst case where a database connection cannot be shared among threads. If that is not the case, then create only one connection and use it for all submitted tasks:

from multiprocessing.pool import ThreadPool
import time
import pandas as pd
import pyodbc

def run_sql(conn, sql):
    return pd.read_sql(sql, conn)

def main():
    SSMS_query1 = "SELECT * FROM TABLE1"
    SSMS_query2 = "SELECT * FROM TABLE2"
    HANADB_query3 = "SELECT * FROM TABLE3"
    
    queries = (SSMS_query1, SSMS_query2, HANADB_query3)
    n_queries = len(queries)

    server = "server_name"
    connections = [
        pyodbc.connect("DRIVER={SQL Server};SERVER=" + server + ";trusted_connection=Yes")
            for _ in range(n_queries)
    ]

    t0 = time.time()
    # One thread per query:
    with ThreadPool(n_queries) as pool:
        results = pool.starmap(run_sql, zip(connections, queries))
        df1, df2, df3 = results # Unpack    
        t1 = time.time()
        print(df1)
        print(df2)
        print(df3)
        print(t1 - t0)

if __name__ == '__main__':
    main()

edited Sep 7, 2022 at 11:46

answered Sep 6, 2022 at 19:28

Booboo

45.6k4 gold badges46 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

smichael_44 Over a year ago

Thank you, yeah, I tried to keep it simple because I don't want someone flat out telling me how to solve my problem. I have 6 queries that share the same HANA connection string, 5 that share a connection string to one SSMS DB, and then 1 that is completely independent. I'm going to play around with your solution and see what happens, I appreciate it

smichael_44 Over a year ago

My only question left would be, if I want to store the results of query1, query2, and query3 in the corresponding df1, df2, and df3, how would I do that? Can I just store the unnamed dataframes in like a list and reference them by index or something? Im not sure what the best option is there

Booboo Over a year ago

The posted code is already creating dataframes from the query results with return pd.read_sql(sql, conn) so results in the main process is a list of these. You can then unpack the list: df1, df2, df3 = results.

smichael_44 Over a year ago

Makes sense, I did not see that. Thank you, this definitely helps!

Booboo Over a year ago

And, of course, instead of unpacking the list you can simply index results, e.g. results[0].

|

Collectives™ on Stack Overflow

Python - Parallel SQL Queries and return dataframes for each

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related