Python Dataframe to SQL Query

Question

I have developed a Python script that reads a CSV file which is a result of a SQL query (just a select * from table) and I perform some transformations and calculations on that dataframe.

I get the dataframe using the following Python commands:

result=csv_df.sort_values(by=['column1','column2','column3'],ascending=True)
result=result.drop_duplicates(['column1','column2'])

Now I need to get the same table using a SQL Query. I have tried the following in T-SQL but I have not been succesful.

select * from data
    where column1 IN
    (select distinct column1,column2 from data)
 and 
    where column2 IN
    (select distinct column1,column2 from data)
    order by column1,column2;

I am new to SQL syntax, can someone help me with the query?

What I am trying to do is delete all the duplicated rows from the combination of column1 and column2.

In Python the reason I include column3 is because it has NULL values that I need to discard.

After this should I create a view to keep on performing calculations?

Parfait · Accepted Answer · 2017-01-04 01:47:53Z

2

Assuming a unique ID in table, consider taking the record with lowest ID of matching column1 and column2 pairs:

SELECT * FROM data AS main
WHERE main.ID IN
    (SELECT sub.MinID FROM
       (SELECT column1, column2, Min(ID) As MinID
        FROM data
        GROUP BY column1, column2) AS sub)
ORDER BY main.column1, main.column2;

Alternatively, with JOIN:

SELECT main.* FROM data AS main
INNER JOIN 
    (SELECT column1, column2, Min(ID) As MinID
     FROM data
     GROUP BY column1, column2) AS sub
ON main.ID = sub.MinID
ORDER BY main.column1, main.column2;

Even still, with EXISTS:

SELECT main.* FROM data AS main
WHERE EXISTS
  (SELECT 1 FROM 
      (SELECT column1, column2, Min(ID) As MinID
       FROM data
       GROUP BY column1, column2) sub
   WHERE main.ID = sub.MinID)
ORDER BY main.column1, main.column2;

And a non-Window function query using correlated count subquery (for potential compatibility with MySQL, SQLite, and MS Access). This version leaves out records if any of two columns are NULL:

SELECT * FROM 
  (SELECT *, 
      (SELECT Count(*) FROM data sub
       WHERE sub.ID <= data.ID
       AND sub.column1 = data.column1
       AND sub.column2 = data.column2) AS rn
   FROM data) AS main
WHERE main.rn = 1

answered Jan 4, 2017 at 1:47

Parfait

108k19 gold badges102 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Max Payne Over a year ago

This is what I was looking for. Can you please elaborate on the logic behind the first answer? Why do you use the Min(ID)? Thanks!

Parfait Over a year ago

Just like pandas, the drop_duplicates keeps first (default) and drops matches afterward. First here is the minimum ID. You can easily change to Max(ID).

Abhishek Anand · Accepted Answer · 2017-01-03 14:56:40Z

From what I understand you require all records ordered by column1, column2 and column3:

Select * from data order by column1,column2,column3

Now, on top of this you want to remove duplicate rows in columns participantObjectId and slipObjectId. First partition columns on the basis of participantObjectId and slipObjectId. The query below wraps on top of the above query and adds another field id which provides unique values for each row of the data.

select *, ROW_NUMBER() OVER (PARTITION BY participantObjectId,slipObjectId order by column1,column2,column3) as id
from (select * from data order by column1,column2,column3)

On top of this we add another select statement with an added condition to choose only those rows with id equal to 1.

select * from
(select *, ROW_NUMBER() OVER (PARTITION BY participantObjectId,slipObjectId order by column1,column2,column3) as id from 
(Select * from data order by column1,column2,column3
)) where id=1;

MaxU - stand with Ukraine · Accepted Answer · 2017-01-03 23:36:22Z

0

If I understood your question correctly, you can do it using ROW_NUMBER() function:

with VirtTab as (
    select
      t.*,
      row_number()
        over(partition by column1, column2 order by column1, column2) as rn
    from data t
)
select * from VirtTab
where rn = 1
order by column1, column2;

edited Jan 3, 2017 at 23:36

answered Jan 3, 2017 at 12:23

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

1 Comment

Max Payne Over a year ago

I'm getting Incorrect syntax near the keyword 'where'. Any idea why this can be?

Collectives™ on Stack Overflow

Python Dataframe to SQL Query

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related