3

So I've been looking into multiproccessing or parallel processes in Python to perform about a dozen or so SQL queries. Right now, the queries are done serially, and it takes about 4 minutes, where 1 query takes as long as the other 11 to do. So theoretically I could cut my total run time in half at least if I could run all the queries in parallel.

I'm trying to do something along the lines of the following and I haven't been able to find documentation supporting if its really possible with my current thought processes:

So, say I have:

SSMS_query1 = "SELECT * FROM TABLE1"

SSMS_query2 = "SELECT * FROM TABLE2"

HANADB_query3 = "SELECT * FROM TABLE3"

So to connect to SSMS I use:

import pyodbc
server = "server_name"
cnxn = pyodbc.connect("DRIVER={SQL Server};SERVER=" + server + ";trusted_connection=Yes")

Then to connect to my HANAdb's I use:

from hdbcli import dbapi
conn = dbapi.connect(address="", port=, user="", password="")

Then essentially I want to do something where I can take advantage of pooling to save time, like:

import pandas as pd
with cnxn, conn as ssms, hana:
    df1 = pd.read_sql(SSMS_query1, ssms)
    df2 = pd.read_sql(SSMS_query2, ssms)
    df3 = pd.read_sql(HANADB_query3, hana)

I've tried using:

import multiprocessing
import threading

But I can't get the desired output, because eventually I want to output df1, df2, and df3 to excel. So how do I store the dataframes and use them as output later on using parallelism?

2
  • 1
    Take a look at ThreadPoolExecutor: docs.python.org/3/library/concurrent.futures.html. Your program is clearly IO bound while waiting for the SQL queries Commented Sep 6, 2022 at 19:17
  • Yeah, reading thru the docs this looks good. I just need to figure out how to exactly apply it to my application. But this is definitely a lot closer to what I need than what I was originally looking at, thank you Commented Sep 7, 2022 at 11:38

1 Answer 1

3

I would think that multithreading might be more efficient than multiprocessing not knowing precisely how large the dataframes being created are since in general with multiprocessing there is a lot more overhead in moving results from a child process back to the main process. But since the queries take 4 minutes, I have to assume the amount of data is fairly large. Besides, much of the time spent is in network activity for which multithreading is well-suited.

Here I am assuming the worst case where a database connection cannot be shared among threads. If that is not the case, then create only one connection and use it for all submitted tasks:

from multiprocessing.pool import ThreadPool
import time
import pandas as pd
import pyodbc

def run_sql(conn, sql):
    return pd.read_sql(sql, conn)

def main():
    SSMS_query1 = "SELECT * FROM TABLE1"
    SSMS_query2 = "SELECT * FROM TABLE2"
    HANADB_query3 = "SELECT * FROM TABLE3"
    
    queries = (SSMS_query1, SSMS_query2, HANADB_query3)
    n_queries = len(queries)

    server = "server_name"
    connections = [
        pyodbc.connect("DRIVER={SQL Server};SERVER=" + server + ";trusted_connection=Yes")
            for _ in range(n_queries)
    ]

    t0 = time.time()
    # One thread per query:
    with ThreadPool(n_queries) as pool:
        results = pool.starmap(run_sql, zip(connections, queries))
        df1, df2, df3 = results # Unpack    
        t1 = time.time()
        print(df1)
        print(df2)
        print(df3)
        print(t1 - t0)

if __name__ == '__main__':
    main()
Sign up to request clarification or add additional context in comments.

10 Comments

Thank you, yeah, I tried to keep it simple because I don't want someone flat out telling me how to solve my problem. I have 6 queries that share the same HANA connection string, 5 that share a connection string to one SSMS DB, and then 1 that is completely independent. I'm going to play around with your solution and see what happens, I appreciate it
My only question left would be, if I want to store the results of query1, query2, and query3 in the corresponding df1, df2, and df3, how would I do that? Can I just store the unnamed dataframes in like a list and reference them by index or something? Im not sure what the best option is there
The posted code is already creating dataframes from the query results with return pd.read_sql(sql, conn) so results in the main process is a list of these. You can then unpack the list: df1, df2, df3 = results.
Makes sense, I did not see that. Thank you, this definitely helps!
And, of course, instead of unpacking the list you can simply index results, e.g. results[0].
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.