Python - writing raw SQL to manipulate dataframes in Python

Question

I am trying to perform SQL-language functions on Python Dataframes as if they were tables in Microsoft SQL Server. Seraching around, it appears that R has the module sqldf, and Python has some record of compatibility with pandasql - however I am unable to get Rodeo to work, if that is a requirement.

This Blog has records of the above. I am unable to import sqldf or pandasql by running any combination of

import pandasql as pdsql
from pandasql import sqldf
pysql = lambda q: pdsql.sqldf(q, globals())
which I scavenged from here and there.

In SAS, you are able to manipulate SAS datasets using PROC SQL as such:

PROC SQL;
    SELECT 
        b.patid,
        CASE WHEN ECD='1234' THEN 'ACTIVE' ELSE 'ACTIVE' END AS ACTIVE_INACTIVE,        
        b.SUMMARY_ID
    FROM SAStable1 a
    LEFT JOIN SAStable2 b
        ON a.patient_id=b.patid
       ;
QUIT;

This results in being able to type a SQL query on SAS datasets in SAS. This is different than pandas.read_sql_query(query, connection) function, which works great on running SQL queries on connected databases, but not in actual dataframes once they are in Python (unless I am missing something).

Is there anything like this for Python? Given that its available in SAS and R, I would be surprised, but my searches yield nothing actionable.

Thanks!

did you install pandasql? what errors did you get?

Ereli
– Ereli

2018-06-21 20:24:08 +00:00
Commented Jun 21, 2018 at 20:24 — Ereli
– Ereli, Commented Jun 21, 2018 at 20:24

Ereli · Accepted Answer · 2018-06-21 20:34:35Z

You don't need SQL to do this in pandas. you can JOIN two dataframes using:

df1.set_index('patient_id').join(df2.set_index('patid'))

you can create a column based on a condition, similar to CASE WHEN ECD='1234' THEN 'ACTIVE' ELSE 'ACTIVE' END AS ACTIVE_INACTIVE, by doing something like:

import pandas as pd
df1 = pd.DataFrame([[100,1234],[101,1234],[102,4356]], columns=['patient_id','ECD'])
df2 = pd.DataFrame([[100,345],[101,23423],[102,3423]], columns=['patid','SUMMARY_ID'])
df3 = df1.set_index('patient_id').join(df2.set_index('patid'))

def active(row):
     if row['ECD'] == 1234 :
         return 'ACTIVE'
     else:
         return 'INACTIVE'

df3['ACTIVE_INACTIVE'] = df3.apply(active, axis=1))

If you really need to use SQL, you can install pandasql using

sudo -H pip3 install pandasql

you can then use it like you would expect:

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
q = """SELECT
        *,
        CASE WHEN a.ECD='1234' THEN 'ACTIVE' ELSE 'INACTIVE' END AS ACTIVE_INACTIVE        
        FROM
        df1 a
      JOIN
        df2 b
           ON a.patient_id = b.patid;"""
print(pysqldf(q).head())

Parfait · Accepted Answer · 2018-06-22 15:59:28Z

2

As source codes show, both R's sqldf and Python's pandasql actually run in-memory SQLite instances (SQLite by default for R). So really, you can replicate the functionality with pandas' SQL methods (read_sql and to_sql) by interfacing with SQLAlchemy which pandasql in fact does under the hood! Specifically, consider the following with demo example:

Import all needed environment dataframes into an in-memory SQLite database:

import numpy as np
import pandas as pd
from sqlalchemy import create_engine

# IN-MEMORY DATABASE (NO PATH SPECIFIED)
engine = create_engine('sqlite://')

dates = pd.date_range('2018-01-01', '2018-06-22', freq='D')
df1 = pd.DataFrame({'current_date': np.random.choice(dates, 50),
                    'analysis_tool': 'pandas',                  
                    'num_value': np.random.randint(100, size=50)*1000                           
                   }, columns=['current_date', 'analysis_tool', 'num_value'])

df2 = pd.DataFrame({'current_date': np.random.choice(dates, 50),
                    'analysis_tool': 'r',                  
                    'num_value': np.random.randint(100, size=50)*1000                           
                   }, columns=['current_date', 'analysis_tool', 'num_value'])

df3 = pd.DataFrame({'current_date': np.random.choice(dates, 50),
                    'analysis_tool': 'sas',                  
                    'num_value': np.random.randint(100, size=50)*1000                           
                   }, columns=['current_date', 'analysis_tool', 'num_value'])

df1.to_sql("df1", con=engine, if_exists='replace')
df2.to_sql("df2", con=engine, if_exists='replace')
df3.to_sql("df3", con=engine, if_exists='replace')

Run your needed SQL queries to update and manipulate data:

# QUERIES RUN IN A TRANSACTION
with engine.begin() as cn:
   cn.execute("UPDATE df1 SET analysis_tool = 'python pandas'")
   cn.execute("INSERT INTO df3 (analytic_tool, current_date, num_value) VALUES (?, ?, ?)", 
              ('sas', '2018-06-23', 51000))

Import as a pandas dataframe:

strSQL = """SELECT * FROM df1
            UNION ALL
            SELECT * FROM df2
            UNION ALL
            SELECT * FROM df3;"""

df_all = pd.read_sql(strSQL, engine)

engine.dispose()                       # IN-MEMORY DATABASE DESTROYED

edited Jun 22, 2018 at 15:59

answered Jun 22, 2018 at 15:13

Parfait

108k19 gold badges102 silver badges138 bronze badges

5 Comments

Samsonite Manly Over a year ago

thanks - i'm too new to understand this, but im not able to run part 2 as is

Samsonite Manly Over a year ago

OperationalError: (sqlite3.OperationalError) near ")": syntax error [SQL: "UPDATE df1 SET analysis_tool = 'python pandas')"] (Background on this error at: sqlalche.me/e/e3q8)

Samsonite Manly Over a year ago

if i put the " outside the )) after 'python pandas' i get: cn.execute("INSERT INTO df3 (analytic_tool, current_date, num_value) VALUES (?, ?, ?)", ^ SyntaxError: invalid syntax.... with the "^" pointing to cn

Parfait Over a year ago

Whoops! There should not be any parenthesis inside the UPDATE statement raising the SQLite error. The second error is a Python syntax error due to extra parenthesis that should be removed.

Samsonite Manly Over a year ago

great! along with analytic_tool as analysis_tool. I'm not exactly sure how to use this. I was hoping for: for df1 and df2 which exist as dataframes in python as variables, do something like "select * from df1 where Patient_id not in df2". is that doable with this method?

Collectives™ on Stack Overflow

Python - writing raw SQL to manipulate dataframes in Python

2 Answers 2

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related