Transforming a Pandas DataFrame into a VALUES sql statement

Question

Using pandas in python, I need to be able to generate efficient queries from a dataframe into postgresql. Unfortunately DataFrame.to_sql(...) only performs direct inserts and the query i wish to make is fairly complicated.

Ideally, I'd like to do this:

WITH my_data AS (
  SELECT * FROM (
    VALUES 
    <dataframe data>
  ) AS data (col1, col2, col3)
)
UPDATE my_table 
SET
my_table.col1 = my_data.col1,
my_table.col2 = complex_function(my_table.col2, my_data.col2),
FROM my_data
WHERE my_table.col3 < my_data.col3;

However, to do that, i would need to turn my dataframe into a plain values statement. I could, of course, rewrite my own functions, but past experiences have taught me that writing functions to escape and sanitize sql should never be done manually.

We are using SQLAlchemy, but bound parameters seem to only work with a limited number of arguments, and ideally i would like the serialization of the dataframe into text to be done at C-speed.

So, is there a way, either through pandas, or through SQLAlchemy, to turn efficiently my dataframe into the values substatement, and insert it into my query?

i have a similar one which i save as a proc and using pandas i pd.read_sql_query('EXEC proc_name'). Let me know if i misunderstood the query — anky
– anky, Commented Jun 3, 2019 at 15:49

unutbu · Accepted Answer · 2019-06-03 17:03:34Z

4

You could use psycopg2.extras.execute_values. For example, given this setup

CREATE TABLE my_table (
col1 int
, col2 text
, col3 int
);
INSERT INTO my_table VALUES 
(99, 'X', 1)
, (99, 'Y', 2)
, (99, 'Z', 99);

# | col1 | col2 | col3 |
# |------+------+------|
# |   99 | X    |    1 |
# |   99 | Y    |    2 |
# |   99 | Z    |   99 |

The python code

import psycopg2
import psycopg2.extras as pge
import pandas as pd
import config

df = pd.DataFrame([
    (1, 'A', 10), 
    (2, 'B', 20),
    (3, 'C', 30)])

with psycopg2.connect(host=config.HOST, user=config.USER, password=config.PASS, database=config.USER) as conn:
    with conn.cursor() as cursor:
        sql = '''WITH my_data AS (
          SELECT * FROM (
            VALUES %s
          ) AS data (col1, col2, col3)
        )
        UPDATE my_table 
        SET
        col1 = my_data.col1,
        -- col2 = complex_function(col2, my_data.col2)
        col2 = my_table.col2 || my_data.col2
        FROM my_data
        WHERE my_table.col3 < my_data.col3'''

        pge.execute_values(cursor, sql, df.values)

updates my_table to be

# SELECT * FROM my_table
| col1 | col2 | col3 |
|------+------+------|
|   99 | Z    |   99 |
|    1 | XA   |    1 |
|    1 | YA   |    2 |

Alternatively, you could use psycopg2 to generate the SQL. The code in format_values is almost entirely copied from the source code for pge.execute_values.

import psycopg2
import psycopg2.extras as pge
import pandas as pd
import config

df = pd.DataFrame([
    (1, "A'foo'", 10), 
    (2, 'B', 20),
    (3, 'C', 30)])


def format_values(cur, sql, argslist, template=None, page_size=100):
    enc = pge._ext.encodings[cur.connection.encoding]
    if not isinstance(sql, bytes):
        sql = sql.encode(enc)
    pre, post = pge._split_sql(sql)
    result = []
    for page in pge._paginate(argslist, page_size=page_size):
        if template is None:
            template = b'(' + b','.join([b'%s'] * len(page[0])) + b')'
        parts = pre[:]
        for args in page:
            parts.append(cur.mogrify(template, args))
            parts.append(b',')
        parts[-1:] = post
        result.append(b''.join(parts))
    return b''.join(result).decode(enc)

with psycopg2.connect(host=config.HOST, user=config.USER, password=config.PASS, database=config.USER) as conn:
    with conn.cursor() as cursor:
        sql = '''WITH my_data AS (
          SELECT * FROM (
            VALUES %s
          ) AS data (col1, col2, col3)
        )
        UPDATE my_table 
        SET
        col1 = my_data.col1,
        -- col2 = complex_function(col2, my_data.col2)
        col2 = my_table.col2 || my_data.col2
        FROM my_data
        WHERE my_table.col3 < my_data.col3'''

        print(format_values(cursor, sql, df.values))

yields

WITH my_data AS (
          SELECT * FROM (
            VALUES (1,'A''foo''',10),(2,'B',20),(3,'C',30)
          ) AS data (col1, col2, col3)
        )
        UPDATE my_table 
        SET
        col1 = my_data.col1,
        -- col2 = complex_function(col2, my_data.col2)
        col2 = my_table.col2 || my_data.col2
        FROM my_data
        WHERE my_table.col3 < my_data.col3

edited Jun 3, 2019 at 17:03

answered Jun 3, 2019 at 16:46

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rémi Bonnet Over a year ago

This could be a decent workaround, but it'd be quite annoying since i'm working within a sqlalchemy transaction in which i'm doing regular (but unrelated) orm operations.

unutbu Over a year ago

I've added some code to show how you could use psycopg2 to generate the SQL without executing it. You could then use sqlalchemy to execute the SQL.

Rémi Bonnet Over a year ago

Thanks. I have also looked a bit on my side, and it seems i can access my current transaction connection by doing session.connection.connection. If this is the case, this would basically solve my problem. I'll let this open a couple more hours to see if a more pure pandas/sqlalchemy response appears, but it seems your solution fits.

Collectives™ on Stack Overflow

Transforming a Pandas DataFrame into a VALUES sql statement

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related