12

I am currently querying data into dataframe via the pandas.io.sql.read_sql() command. I wanted to parallelize the calls similar to what this guys is advocating: (Embarrassingly parallel database calls with Python (PyData Paris 2015 ))

Something like (very general):

pools = [ThreadedConnectionPool(1,20,dsn=d) for d in dsns]
connections = [pool.getconn() for pool in pools]
parallel_connection = ParallelConnection(connections)
pandas_cursor = parallel_connection.cursor()
pandas_cursor.execute(my_query)

Is something like that possible?

3
  • what is your SQL database type and driver, and do they support multi-threaded calls? Commented Aug 29, 2015 at 19:19
  • 1
    using MS sql server, it does support multi-threaded calls Commented Aug 30, 2015 at 11:34
  • not sure about pyodbc, but since 2013 pymssql seems to be thread-safe for multi-threading: pymssql.org/en/latest/changelog.html?highlight=threading Commented Aug 30, 2015 at 21:39

1 Answer 1

2

Yes, this should work, although with the caveat that you'll need to change parallel_connection.py in that talk that you site. In that code there's a fetchall function which executes each of the cursors in parallel, then combines the results. This is the core of what you'll change:

Old Code:

def fetchall(self):
    results = [None] * len(self.cursors)
    def do_work(index, cursor):
        results[index] = cursor.fetchall()
    self._do_parallel(do_work)
    return list(chain(*[rs for rs in results]))

New Code:

def fetchall(self):
    results = [None] * len(self.sql_connections)
    def do_work(index, sql_connection):
        sql, conn = sql_connection  #  Store tuple of sql/conn instead of cursor
        results[index] = pd.read_sql(sql, conn)
    self._do_parallel(do_work)
    return pd.DataFrame().append([rs for rs in results])

Repo: https://github.com/godatadriven/ParallelConnection

Sign up to request clarification or add additional context in comments.

4 Comments

is it possible to show a example on how you actually pass a query sql, conn = sql_connection basically we need to pass tuple of sql and connection?.
It's been a few years, so I don't fully remember the context - but it looks like from the linked code that you would pass in an array of (sql, conn) tuples to the constructor of ParallelConnection. Something like ParallelConnection([(sql1, con1), (sql2, con2)])
so no need to call execute() while passing a query string? just like in the question...
In the above example, I used fetchall instead of execute, but you can do the same thing with execute. After initializing the ParallelConnection with the array of tuples, call either execute or fetchall and the _do_parallel function handles passing out the work to the individual connections/queries.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.