4

Using pandas in python, I need to be able to generate efficient queries from a dataframe into postgresql. Unfortunately DataFrame.to_sql(...) only performs direct inserts and the query i wish to make is fairly complicated.

Ideally, I'd like to do this:

WITH my_data AS (
  SELECT * FROM (
    VALUES 
    <dataframe data>
  ) AS data (col1, col2, col3)
)
UPDATE my_table 
SET
my_table.col1 = my_data.col1,
my_table.col2 = complex_function(my_table.col2, my_data.col2),
FROM my_data
WHERE my_table.col3 < my_data.col3;

However, to do that, i would need to turn my dataframe into a plain values statement. I could, of course, rewrite my own functions, but past experiences have taught me that writing functions to escape and sanitize sql should never be done manually.

We are using SQLAlchemy, but bound parameters seem to only work with a limited number of arguments, and ideally i would like the serialization of the dataframe into text to be done at C-speed.

So, is there a way, either through pandas, or through SQLAlchemy, to turn efficiently my dataframe into the values substatement, and insert it into my query?

1
  • i have a similar one which i save as a proc and using pandas i pd.read_sql_query('EXEC proc_name'). Let me know if i misunderstood the query Commented Jun 3, 2019 at 15:49

1 Answer 1

4

You could use psycopg2.extras.execute_values. For example, given this setup

CREATE TABLE my_table (
col1 int
, col2 text
, col3 int
);
INSERT INTO my_table VALUES 
(99, 'X', 1)
, (99, 'Y', 2)
, (99, 'Z', 99);

# | col1 | col2 | col3 |
# |------+------+------|
# |   99 | X    |    1 |
# |   99 | Y    |    2 |
# |   99 | Z    |   99 |

The python code

import psycopg2
import psycopg2.extras as pge
import pandas as pd
import config

df = pd.DataFrame([
    (1, 'A', 10), 
    (2, 'B', 20),
    (3, 'C', 30)])

with psycopg2.connect(host=config.HOST, user=config.USER, password=config.PASS, database=config.USER) as conn:
    with conn.cursor() as cursor:
        sql = '''WITH my_data AS (
          SELECT * FROM (
            VALUES %s
          ) AS data (col1, col2, col3)
        )
        UPDATE my_table 
        SET
        col1 = my_data.col1,
        -- col2 = complex_function(col2, my_data.col2)
        col2 = my_table.col2 || my_data.col2
        FROM my_data
        WHERE my_table.col3 < my_data.col3'''

        pge.execute_values(cursor, sql, df.values)

updates my_table to be

# SELECT * FROM my_table
| col1 | col2 | col3 |
|------+------+------|
|   99 | Z    |   99 |
|    1 | XA   |    1 |
|    1 | YA   |    2 |

Alternatively, you could use psycopg2 to generate the SQL. The code in format_values is almost entirely copied from the source code for pge.execute_values.

import psycopg2
import psycopg2.extras as pge
import pandas as pd
import config

df = pd.DataFrame([
    (1, "A'foo'", 10), 
    (2, 'B', 20),
    (3, 'C', 30)])


def format_values(cur, sql, argslist, template=None, page_size=100):
    enc = pge._ext.encodings[cur.connection.encoding]
    if not isinstance(sql, bytes):
        sql = sql.encode(enc)
    pre, post = pge._split_sql(sql)
    result = []
    for page in pge._paginate(argslist, page_size=page_size):
        if template is None:
            template = b'(' + b','.join([b'%s'] * len(page[0])) + b')'
        parts = pre[:]
        for args in page:
            parts.append(cur.mogrify(template, args))
            parts.append(b',')
        parts[-1:] = post
        result.append(b''.join(parts))
    return b''.join(result).decode(enc)

with psycopg2.connect(host=config.HOST, user=config.USER, password=config.PASS, database=config.USER) as conn:
    with conn.cursor() as cursor:
        sql = '''WITH my_data AS (
          SELECT * FROM (
            VALUES %s
          ) AS data (col1, col2, col3)
        )
        UPDATE my_table 
        SET
        col1 = my_data.col1,
        -- col2 = complex_function(col2, my_data.col2)
        col2 = my_table.col2 || my_data.col2
        FROM my_data
        WHERE my_table.col3 < my_data.col3'''

        print(format_values(cursor, sql, df.values))

yields

WITH my_data AS (
          SELECT * FROM (
            VALUES (1,'A''foo''',10),(2,'B',20),(3,'C',30)
          ) AS data (col1, col2, col3)
        )
        UPDATE my_table 
        SET
        col1 = my_data.col1,
        -- col2 = complex_function(col2, my_data.col2)
        col2 = my_table.col2 || my_data.col2
        FROM my_data
        WHERE my_table.col3 < my_data.col3
Sign up to request clarification or add additional context in comments.

3 Comments

This could be a decent workaround, but it'd be quite annoying since i'm working within a sqlalchemy transaction in which i'm doing regular (but unrelated) orm operations.
I've added some code to show how you could use psycopg2 to generate the SQL without executing it. You could then use sqlalchemy to execute the SQL.
Thanks. I have also looked a bit on my side, and it seems i can access my current transaction connection by doing session.connection.connection. If this is the case, this would basically solve my problem. I'll let this open a couple more hours to see if a more pure pandas/sqlalchemy response appears, but it seems your solution fits.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.