Using Pandas Dataframe within a SQL Join

Question

I'm trying to perform a SQL join on the the contents of a dataframe with an external table I have in a Postgres Database.

This is what the Dataframe looks like:

>>> df
   name  author  count
0  a     b       10
1  c     d       5
2  e     f       2

I need to join it with a Postgres table that looks like this:

TABLE: blog
title   author    url    
a       b         w.com
b       b         x.com
e       g         y.com

This is what I'm attempting to do, but this doesn't appear to be the right syntax for the query:

>>> sql_join = r"""select b.*, frame.*  from ({0}) frame
        join blog b
        on frame.name = b.title
        where frame.owner = b.owner 
        order by frame.count desc
        limit 30;""".format(df)

>>> res = pd.read_sql(sql_join, connection)

I'm not sure how I can use the values in the dataframes within the sql query. Can someone point me in the right direction? Thanks!

Edit: As per my use case, I'm not able to convert the blog table into a dataframe given memory and performance constraints.

IRSAgent · Accepted Answer · 2019-01-14 00:45:00Z

10

I managed to do this without having to convert the dataframe to a temp table or without reading SQL into a dataframe from the blog table.

For anyone else facing the same issue, this is achieved using a virtual table of sorts.

This is what my final sql query looks like this:

>>> inner_string = "VALUES ('a','b',10), ('c','d',5), ('e','f',2)"

>>> sql_join = r"""SELECT * FROM blog
        JOIN ({0}) AS frame(title, owner, count)
        ON blog.title = frame.title
        WHERE blog.owner = frame.owner 
        ORDER BY frame.count DESC
        LIMIT 30;""".format(inner_string)

>>> res = pd.read_sql(sql_join, connection)

You can use string manipulation to convert all rows in the dataframe into one large string similar to inner_string.

edited Jan 14, 2019 at 0:45

answered Jan 13, 2019 at 22:23

IRSAgent

3391 gold badge3 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

niharika Over a year ago

This way helped me - WITH my_values(num, str) AS ( VALUES (1, 'one'), (2, 'two'), (3, 'three') ) SELECT num, txt FROM my_values

Mayank Porwal · Accepted Answer · 2019-01-11 04:57:11Z

2

You should create another dataframe from the Postgres table and then join both dataframes.

You can use read_sql to create a df from table:

import psycopg2  ## Python connector library to Postgres
import pandas as pd

conn = psycopg2.connect(...) ## Put your DB credentials here
blog_df = pd.read_sql('blog', con=conn)
## This will bring `blog` table's data into blog_df

It should look like this:

In [258]: blog_df
Out[258]: 
  title author    url
0     a      b  w.com
1     b      b  x.com
2     e      g  y.com

Now, you can join df and blog_df using merge like below:

In [261]: pd.merge(df, blog_df, left_on='name', right_on='title')
Out[261]: 
  name author_x  count title author_y    url
0    a        b     10     a        b  w.com
1    e        f      2     e        g  y.com

You will get result like above. You can clean it further.

Let me know if this helps.

answered Jan 11, 2019 at 4:57

Mayank Porwal

34.2k9 gold badges45 silver badges65 bronze badges

6 Comments

IRSAgent Over a year ago

Thanks for the response :) Merging dataframes is one way to go. Unfortunately, bringing the blog table in is what I'm trying to avoid doing. The table is quite large and therefore not a viable solution in my use case. Is there an alternative approach?

Parfait Over a year ago

@GenAsis - Then do it the reverse way: push data frame to Postgres and run your JOIN there. Use to_sql via an SQLAlchemy connection.

IRSAgent Over a year ago

@Parfait I considered that and It appears that might be the only way to go. I'm performing this operation quite frequently and in an ideal scenario, I don't want to be creating (temporary) tables to perform a join all the time. Is this what you meant by pushing the dataframe to postgres?

Parfait Over a year ago

You could create the temp/staging table once and clean/populate and join to final table as regularly as needed. This would be the more efficient way as you can even set up indexing on join fields between staging and final table for Postgres' powerful query engine!

IRSAgent Over a year ago

Thanks @Parfait, I managed to find a clean solution (that I posted below) that works for me!

|

Benfunks · Accepted Answer · 2020-12-14 23:04:38Z

I've had similar problems. I found a work-around that allows me to join two different servers where i only have read-only rights. using sqlalchemy insert the pandas dataframe and then join

import sqlalchemy as sa
import pandas as pd

metadata = MetaData()
sql_of_df = sa.Table(
    "##df",
    metadata,
    sa.Column("name", sa.String(x),   primary_key=True),
    sa.Column("author", sa.String(x), nullable=False),
    sa.Columnt("count", sa.Integer),
)
metadata.create_all(engine)
dataframe_dict = df.to_dict(orient='records')
insert_statement = sql_of_df.insert().values(
    {
        "name":sa.bindparam("name"),
        "author":sa.bindparam("author"),
        "count":sa.bindparam("count"),
   }
)
session.execute(insert_statement, dataframe_dict)

statement=sa.text("SELECT * from blog Inner join ##df on blog.Title = ##df.name")
session.execute(statement)

roboes · Accepted Answer · 2023-09-14 16:38:13Z

I had a similar issue using the SQLAlchemy package and SQLite database connection. This answer did not work for me. Adapting it a bit, I was able to achieve the desired result:

# Import packages
import pandas as pd
# from sqlalchemy import create_engine

# Create sample DataFrame
df = pd.DataFrame(data=[[1, 'A'], [2, 'B'], [3, 'C']], index=None, columns=None, dtype=None)

# Create a list of tuples from the DataFrame
data = str([tuple(x) for x in df.values]).replace('[', '').replace(']', '')

# Create a SQL table from the DataFrame (incl. column names)
inner_table = """
SELECT [column1] AS column_a,
    CAST([column2] AS TEXT) AS column_b
FROM (VALUES {})
""".format(data)

# Query
query = """
SELECT table.*
FROM table
INNER JOIN ({}) AS inner_table
ON table.column_a = inner_table.column_a
""".format(inner_table)

# Read SQL table
pd.read_sql_query(sql=query, con=connection, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None, dtype=None)

Collectives™ on Stack Overflow

Using Pandas Dataframe within a SQL Join

4 Answers 4

1 Comment

6 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

6 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related