7

I'm trying to perform a SQL join on the the contents of a dataframe with an external table I have in a Postgres Database.

This is what the Dataframe looks like:

>>> df
   name  author  count
0  a     b       10
1  c     d       5
2  e     f       2

I need to join it with a Postgres table that looks like this:

TABLE: blog
title   author    url    
a       b         w.com
b       b         x.com
e       g         y.com

This is what I'm attempting to do, but this doesn't appear to be the right syntax for the query:

>>> sql_join = r"""select b.*, frame.*  from ({0}) frame
        join blog b
        on frame.name = b.title
        where frame.owner = b.owner 
        order by frame.count desc
        limit 30;""".format(df)

>>> res = pd.read_sql(sql_join, connection)

I'm not sure how I can use the values in the dataframes within the sql query. Can someone point me in the right direction? Thanks!

Edit: As per my use case, I'm not able to convert the blog table into a dataframe given memory and performance constraints.

4 Answers 4

10

I managed to do this without having to convert the dataframe to a temp table or without reading SQL into a dataframe from the blog table.

For anyone else facing the same issue, this is achieved using a virtual table of sorts.

This is what my final sql query looks like this:

>>> inner_string = "VALUES ('a','b',10), ('c','d',5), ('e','f',2)"

>>> sql_join = r"""SELECT * FROM blog
        JOIN ({0}) AS frame(title, owner, count)
        ON blog.title = frame.title
        WHERE blog.owner = frame.owner 
        ORDER BY frame.count DESC
        LIMIT 30;""".format(inner_string)

>>> res = pd.read_sql(sql_join, connection)

You can use string manipulation to convert all rows in the dataframe into one large string similar to inner_string.

Sign up to request clarification or add additional context in comments.

1 Comment

This way helped me - WITH my_values(num, str) AS ( VALUES (1, 'one'), (2, 'two'), (3, 'three') ) SELECT num, txt FROM my_values
2

You should create another dataframe from the Postgres table and then join both dataframes.

You can use read_sql to create a df from table:

import psycopg2  ## Python connector library to Postgres
import pandas as pd

conn = psycopg2.connect(...) ## Put your DB credentials here
blog_df = pd.read_sql('blog', con=conn)
## This will bring `blog` table's data into blog_df

It should look like this:

In [258]: blog_df
Out[258]: 
  title author    url
0     a      b  w.com
1     b      b  x.com
2     e      g  y.com

Now, you can join df and blog_df using merge like below:

In [261]: pd.merge(df, blog_df, left_on='name', right_on='title')
Out[261]: 
  name author_x  count title author_y    url
0    a        b     10     a        b  w.com
1    e        f      2     e        g  y.com

You will get result like above. You can clean it further.

Let me know if this helps.

6 Comments

Thanks for the response :) Merging dataframes is one way to go. Unfortunately, bringing the blog table in is what I'm trying to avoid doing. The table is quite large and therefore not a viable solution in my use case. Is there an alternative approach?
@GenAsis - Then do it the reverse way: push data frame to Postgres and run your JOIN there. Use to_sql via an SQLAlchemy connection.
@Parfait I considered that and It appears that might be the only way to go. I'm performing this operation quite frequently and in an ideal scenario, I don't want to be creating (temporary) tables to perform a join all the time. Is this what you meant by pushing the dataframe to postgres?
You could create the temp/staging table once and clean/populate and join to final table as regularly as needed. This would be the more efficient way as you can even set up indexing on join fields between staging and final table for Postgres' powerful query engine!
Thanks @Parfait, I managed to find a clean solution (that I posted below) that works for me!
|
0

I've had similar problems. I found a work-around that allows me to join two different servers where i only have read-only rights. using sqlalchemy insert the pandas dataframe and then join

import sqlalchemy as sa
import pandas as pd

metadata = MetaData()
sql_of_df = sa.Table(
    "##df",
    metadata,
    sa.Column("name", sa.String(x),   primary_key=True),
    sa.Column("author", sa.String(x), nullable=False),
    sa.Columnt("count", sa.Integer),
)
metadata.create_all(engine)
dataframe_dict = df.to_dict(orient='records')
insert_statement = sql_of_df.insert().values(
    {
        "name":sa.bindparam("name"),
        "author":sa.bindparam("author"),
        "count":sa.bindparam("count"),
   }
)
session.execute(insert_statement, dataframe_dict)

statement=sa.text("SELECT * from blog Inner join ##df on blog.Title = ##df.name")
session.execute(statement)

Comments

0

I had a similar issue using the SQLAlchemy package and SQLite database connection. This answer did not work for me. Adapting it a bit, I was able to achieve the desired result:

# Import packages
import pandas as pd
# from sqlalchemy import create_engine

# Create sample DataFrame
df = pd.DataFrame(data=[[1, 'A'], [2, 'B'], [3, 'C']], index=None, columns=None, dtype=None)

# Create a list of tuples from the DataFrame
data = str([tuple(x) for x in df.values]).replace('[', '').replace(']', '')

# Create a SQL table from the DataFrame (incl. column names)
inner_table = """
SELECT [column1] AS column_a,
    CAST([column2] AS TEXT) AS column_b
FROM (VALUES {})
""".format(data)

# Query
query = """
SELECT table.*
FROM table
INNER JOIN ({}) AS inner_table
ON table.column_a = inner_table.column_a
""".format(inner_table)

# Read SQL table
pd.read_sql_query(sql=query, con=connection, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None, dtype=None)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.