So I've been looking into multiproccessing or parallel processes in Python to perform about a dozen or so SQL queries. Right now, the queries are done serially, and it takes about 4 minutes, where 1 query takes as long as the other 11 to do. So theoretically I could cut my total run time in half at least if I could run all the queries in parallel.
I'm trying to do something along the lines of the following and I haven't been able to find documentation supporting if its really possible with my current thought processes:
So, say I have:
SSMS_query1 = "SELECT * FROM TABLE1"
SSMS_query2 = "SELECT * FROM TABLE2"
HANADB_query3 = "SELECT * FROM TABLE3"
So to connect to SSMS I use:
import pyodbc
server = "server_name"
cnxn = pyodbc.connect("DRIVER={SQL Server};SERVER=" + server + ";trusted_connection=Yes")
Then to connect to my HANAdb's I use:
from hdbcli import dbapi
conn = dbapi.connect(address="", port=, user="", password="")
Then essentially I want to do something where I can take advantage of pooling to save time, like:
import pandas as pd
with cnxn, conn as ssms, hana:
df1 = pd.read_sql(SSMS_query1, ssms)
df2 = pd.read_sql(SSMS_query2, ssms)
df3 = pd.read_sql(HANADB_query3, hana)
I've tried using:
import multiprocessing
import threading
But I can't get the desired output, because eventually I want to output df1, df2, and df3 to excel. So how do I store the dataframes and use them as output later on using parallelism?
ThreadPoolExecutor: docs.python.org/3/library/concurrent.futures.html. Your program is clearly IO bound while waiting for the SQL queries