1

I am trying to write a query in Python using pandasql. My code is as below,

import pandas as pd
from pandasql import *

data = pd.read_csv('registerlog.csv')

q = """
SELECT
    a.RegistrationMonth, COUNT(DISTINCT a.UserID) AS UserSize,
    COUNT(
        CASE a.MonthDifference
            WHEN 0.0 THEN DISTINCT a.UserID ELSE NULL
        END
    ) AS MonthZero
FROM
    data) AS a
GROUP BY
    a.RegistrationMonth
"""

print sqldf(q, locals())

But this gives the following error,

print sqldf(q, locals()) File "C:\Python27\lib\site-packages\pandasql\sqldf.py", line 156, in sqldf return PandaSQL(db_uri)(query, env) File "C:\Python27\lib\site-packages\pandasql\sqldf.py", line 63, in call raise PandaSQLException(ex) PandaSQLException: (sqlite3.OperationalError) near "DISTINCT": syntax error

But if I use WHEN 0.0 THEN a.user_id ELSE NULL then it works. Also the normal way of COUNT(DISTINCT a.user_id) also works fine.

But I want to get only the DISTINCT values inside the CASE. Is there a way to achieve this to get the DISTINCT count value inside the CASE?

2
  • Did you try COUNT(DISTINCT (CASE ... END)) AS MonthZero? Commented Jan 23, 2018 at 16:41
  • Did some searching and it appears that using DISTINCT inside a CASE statement is problematic... stackoverflow.com/questions/25687345/… Commented Jan 23, 2018 at 16:51

1 Answer 1

0

In the SQL grammer, DISTINCT does not belong to any values (expressions), but to the SELECT or the aggregate function (here: COUNT). So you have to write it directly after the SELECT or the opening parenthesis:

SELECT ..., COUNT(DISTINCT CASE ... END) ...
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.