Use a condition in a WHERE subquery in postgresql

Question

I would like some help on this situation. I have a table with UUID (unique), email (repeated), timestamp (unique) and has_sales (can be 1 if yes and 0 if no)

Sample data

uuid       email       timestamp        has_sales
    1   [email protected] 2016-10-02 10:28:23    0
    2   [email protected] 2017-10-03 10:28:23    0
    3   [email protected] 2017-10-06 17:08:15    1
    4   [email protected] 2017-12-04 20:47:17    0
    5   [email protected] 2018-05-21 15:27:04    0
    6   [email protected] 2016-10-02 10:28:23    1
    7   [email protected] 2017-10-03 10:28:23    0

I would like to choose the oldest timestamp, except when there is a sale on a newer one (it is rare, but it may occur). So, the expected result would be

  uuid     email       timestamp        has_sales
    3   [email protected] 2017-10-06 17:08:15    1
    6   [email protected] 2016-10-02 10:28:23    1

Currently, I am just using the first condition (oldest timestamp), as follows:

SELECT
    dm1.uuid,
    dm1.email,
    dm1.timestamp,
    dm1.has_sales
FROM dup_mail dm1
where
    time_stamp = (select min(time_stamp)
                       from dup_mail dm2
                       where dm1.email = dm2.email
                       )
order by 2

How, upgrading this code, could I add the condition of if there is sale to a newer user and no sale to the older one, I would choose the newer one? Each email is related to either no sales (0 in all duplicate accounts) or yes sale (1 in one of the duplicate accounts and 0 in the other ones). Even if there is more than one duplicate account with sales, I just want to know if there was a sale or not

Yes, each email has either 0 sales or 1 sale in one of the duplicate accounts. I don't believe there are cases with sales in two or more duplicate accounts based on the same email, but if it does, I just care if the email had a sale or not — Flavio de Falcao e Helena
– Flavio de Falcao e Helena, Commented May 23, 2018 at 20:42

spencer7593 · Accepted Answer · 2018-05-23 21:15:51Z

2

The correlated subquery could be rewritten

  SELECT dm2.timestamp
    FROM dup_mail dm2
   WHERE dm2.email = dm1.email 
   ORDER
      BY dm2.has_sales DESC
       , dm2.timestamp ASC 
   LIMIT 1

This is going to sort the rows with has_sales=1 before rows with has_sales=0, and then by timestamp. The LIMIT 1 clause picks the first row (after the set is sorted.)

We are going to want a suitable index on dup_mail table with email as the leading column. Including timestamp and has_sales columns in the index would make it a covering index for the subquery.

This should satisfy the specification, but the correlated subquery may not be optimal in terms of performance.

SELECT dm1.uuid
     , dm1.email
     , dm1.timestamp
     , dm1.has_sales
  FROM dup_mail dm1 
 WHERE dm1.timestamp = 
       ( SELECT dm2.timestamp
           FROM dup_mail dm2
          WHERE dm2.email = dm1.email 
          ORDER
             BY dm2.has_sales DESC
              , dm2.timestamp ASC 
          LIMIT 1
      )
ORDER 
   BY ...

(It's a bit odd that timestamp would be unique across all of the rows; but if it is, then this query could work.)

We might get better performance with something like this:

SELECT dmx.email 
     , IF( MAX(dmx.has_sales)=0
         , MIN(dmx.timestamp)
         , MIN(IF(dmx.has_sales=1,dmx.timestamp,NULL))
       ) AS min_timestamp 
  FROM dup_email dmx
 GROUP BY dmx.email

And then using that as an inline view and join to the dup_mail table to get the rows associated with the minimum timestamp

SELECT dm1.uuid
     , dm1.email
     , dm1.timestamp
     , dm1.has_sales
  FROM ( -- minimum timestamp for each email
         SELECT dmx.email
              , IF( MAX(dmx.has_sales)=0
                  , MIN(dmx.timestamp)
                  , MIN(IF(dmx.has_sales=1,dmx.timestamp,NULL))
                ) AS min_timestamp 
           FROM dup_email dmx
          GROUP BY dmx.email
       ) m
  JOIN dup_email dm1
    ON dm1.email      = m.email 
   AND dm1.timestamp = m.min_timestamp
 ORDER
    BY ...

NOTE

SQL syntax given above is specific to MySQL (the question is tagged as MySQL).

I think the IF() function is a MySQL-only extension.

For PostgreSQL, replace this:

              , IF( MAX(dmx.has_sales)=0
                  , MIN(dmx.timestamp)
                  , MIN(IF(dmx.has_sales=1,dmx.timestamp,NULL))
                ) AS min_timestamp

with a more portable, more ANSI standards compliant

              , CASE WHEN MAX(dmx.has_sales) = 0
                THEN MIN(dmx.timestamp)
                ELSE MIN( CASE WHEN dmx.has_sales = 1
                          THEN dmx.timestamp
                          END
                     )
                END AS min_timestamp

edited May 23, 2018 at 21:15

answered May 23, 2018 at 20:43

spencer7593

109k15 gold badges122 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Flavio de Falcao e Helena Over a year ago

Thank you for answering. In this case, I would have to make a join with dm1 and dm2, right?

spencer7593 Over a year ago

The first part of my answer (the original answer) was intended as a replacement for the correlated subquery in the parentheses in the original query. Answer is updated to add a demonstration of the original query with the replacement, as well as adding an alternative approach... getting the minimum timestamp for all email, as an inline view, and a join to get the rows that match that minimum timestamp.

spencer7593 Over a year ago

The SQL in this answer is specific to MySQL. I'm much less familiar with PostgreSQL, I expect the IF() function is a MySQL-specific extension.

Flavio de Falcao e Helena Over a year ago

Thank you very much. As I am using Postegresql, I would suggest to change the if clause for something like this: SELECT COALESCE(case when MAX(dm2.has_sales=0 then MIN(dm2.timestamp) else NULL END, case when dm2.has_sales=1 then dm2.time_stamp end) as min_timestamp

spencer7593 Over a year ago

Answer edited to include replacement for MySQL IF() functions with more portable, more ANSI standards compliant CASE expressions.

Collectives™ on Stack Overflow

Use a condition in a WHERE subquery in postgresql

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related