1

I am trying to perform SQL-language functions on Python Dataframes as if they were tables in Microsoft SQL Server. Seraching around, it appears that R has the module sqldf, and Python has some record of compatibility with pandasql - however I am unable to get Rodeo to work, if that is a requirement.

This Blog has records of the above. I am unable to import sqldf or pandasql by running any combination of

import pandasql as pdsql
from pandasql import sqldf
pysql = lambda q: pdsql.sqldf(q, globals())
which I scavenged from here and there.

In SAS, you are able to manipulate SAS datasets using PROC SQL as such:

PROC SQL;
    SELECT 
        b.patid,
        CASE WHEN ECD='1234' THEN 'ACTIVE' ELSE 'ACTIVE' END AS ACTIVE_INACTIVE,        
        b.SUMMARY_ID
    FROM SAStable1 a
    LEFT JOIN SAStable2 b
        ON a.patient_id=b.patid
       ;
QUIT;

This results in being able to type a SQL query on SAS datasets in SAS. This is different than pandas.read_sql_query(query, connection) function, which works great on running SQL queries on connected databases, but not in actual dataframes once they are in Python (unless I am missing something).

Is there anything like this for Python? Given that its available in SAS and R, I would be surprised, but my searches yield nothing actionable.

Thanks!

1
  • did you install pandasql? what errors did you get? Commented Jun 21, 2018 at 20:24

2 Answers 2

2

You don't need SQL to do this in pandas. you can JOIN two dataframes using:

df1.set_index('patient_id').join(df2.set_index('patid'))

you can create a column based on a condition, similar to CASE WHEN ECD='1234' THEN 'ACTIVE' ELSE 'ACTIVE' END AS ACTIVE_INACTIVE, by doing something like:

import pandas as pd
df1 = pd.DataFrame([[100,1234],[101,1234],[102,4356]], columns=['patient_id','ECD'])
df2 = pd.DataFrame([[100,345],[101,23423],[102,3423]], columns=['patid','SUMMARY_ID'])
df3 = df1.set_index('patient_id').join(df2.set_index('patid'))

def active(row):
     if row['ECD'] == 1234 :
         return 'ACTIVE'
     else:
         return 'INACTIVE'

df3['ACTIVE_INACTIVE'] = df3.apply(active, axis=1))

If you really need to use SQL, you can install pandasql using

sudo -H pip3 install pandasql

you can then use it like you would expect:

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
q = """SELECT
        *,
        CASE WHEN a.ECD='1234' THEN 'ACTIVE' ELSE 'INACTIVE' END AS ACTIVE_INACTIVE        
        FROM
        df1 a
      JOIN
        df2 b
           ON a.patient_id = b.patid;"""
print(pysqldf(q).head())
Sign up to request clarification or add additional context in comments.

Comments

2

As source codes show, both R's sqldf and Python's pandasql actually run in-memory SQLite instances (SQLite by default for R). So really, you can replicate the functionality with pandas' SQL methods (read_sql and to_sql) by interfacing with SQLAlchemy which pandasql in fact does under the hood! Specifically, consider the following with demo example:

  1. Import all needed environment dataframes into an in-memory SQLite database:

    import numpy as np
    import pandas as pd
    from sqlalchemy import create_engine
    
    # IN-MEMORY DATABASE (NO PATH SPECIFIED)
    engine = create_engine('sqlite://')
    
    dates = pd.date_range('2018-01-01', '2018-06-22', freq='D')
    df1 = pd.DataFrame({'current_date': np.random.choice(dates, 50),
                        'analysis_tool': 'pandas',                  
                        'num_value': np.random.randint(100, size=50)*1000                           
                       }, columns=['current_date', 'analysis_tool', 'num_value'])
    
    df2 = pd.DataFrame({'current_date': np.random.choice(dates, 50),
                        'analysis_tool': 'r',                  
                        'num_value': np.random.randint(100, size=50)*1000                           
                       }, columns=['current_date', 'analysis_tool', 'num_value'])
    
    df3 = pd.DataFrame({'current_date': np.random.choice(dates, 50),
                        'analysis_tool': 'sas',                  
                        'num_value': np.random.randint(100, size=50)*1000                           
                       }, columns=['current_date', 'analysis_tool', 'num_value'])
    
    df1.to_sql("df1", con=engine, if_exists='replace')
    df2.to_sql("df2", con=engine, if_exists='replace')
    df3.to_sql("df3", con=engine, if_exists='replace')
    
  2. Run your needed SQL queries to update and manipulate data:

    # QUERIES RUN IN A TRANSACTION
    with engine.begin() as cn:
       cn.execute("UPDATE df1 SET analysis_tool = 'python pandas'")
       cn.execute("INSERT INTO df3 (analytic_tool, current_date, num_value) VALUES (?, ?, ?)", 
                  ('sas', '2018-06-23', 51000))
    
  3. Import as a pandas dataframe:

    strSQL = """SELECT * FROM df1
                UNION ALL
                SELECT * FROM df2
                UNION ALL
                SELECT * FROM df3;"""
    
    df_all = pd.read_sql(strSQL, engine)
    
    engine.dispose()                       # IN-MEMORY DATABASE DESTROYED
    

5 Comments

thanks - i'm too new to understand this, but im not able to run part 2 as is
OperationalError: (sqlite3.OperationalError) near ")": syntax error [SQL: "UPDATE df1 SET analysis_tool = 'python pandas')"] (Background on this error at: sqlalche.me/e/e3q8)
if i put the " outside the )) after 'python pandas' i get: cn.execute("INSERT INTO df3 (analytic_tool, current_date, num_value) VALUES (?, ?, ?)", ^ SyntaxError: invalid syntax.... with the "^" pointing to cn
Whoops! There should not be any parenthesis inside the UPDATE statement raising the SQLite error. The second error is a Python syntax error due to extra parenthesis that should be removed.
great! along with analytic_tool as analysis_tool. I'm not exactly sure how to use this. I was hoping for: for df1 and df2 which exist as dataframes in python as variables, do something like "select * from df1 where Patient_id not in df2". is that doable with this method?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.