0

I would like some help on this situation. I have a table with UUID (unique), email (repeated), timestamp (unique) and has_sales (can be 1 if yes and 0 if no)

Sample data

uuid       email       timestamp        has_sales
    1   [email protected] 2016-10-02 10:28:23    0
    2   [email protected] 2017-10-03 10:28:23    0
    3   [email protected] 2017-10-06 17:08:15    1
    4   [email protected] 2017-12-04 20:47:17    0
    5   [email protected] 2018-05-21 15:27:04    0
    6   [email protected] 2016-10-02 10:28:23    1
    7   [email protected] 2017-10-03 10:28:23    0

I would like to choose the oldest timestamp, except when there is a sale on a newer one (it is rare, but it may occur). So, the expected result would be

  uuid     email       timestamp        has_sales
    3   [email protected] 2017-10-06 17:08:15    1
    6   [email protected] 2016-10-02 10:28:23    1

Currently, I am just using the first condition (oldest timestamp), as follows:

SELECT
    dm1.uuid,
    dm1.email,
    dm1.timestamp,
    dm1.has_sales
FROM dup_mail dm1
where
    time_stamp = (select min(time_stamp)
                       from dup_mail dm2
                       where dm1.email = dm2.email
                       )
order by 2

How, upgrading this code, could I add the condition of if there is sale to a newer user and no sale to the older one, I would choose the newer one? Each email is related to either no sales (0 in all duplicate accounts) or yes sale (1 in one of the duplicate accounts and 0 in the other ones). Even if there is more than one duplicate account with sales, I just want to know if there was a sale or not

2
  • is your sale on the basis of email? Commented May 23, 2018 at 20:38
  • Yes, each email has either 0 sales or 1 sale in one of the duplicate accounts. I don't believe there are cases with sales in two or more duplicate accounts based on the same email, but if it does, I just care if the email had a sale or not Commented May 23, 2018 at 20:42

1 Answer 1

2

The correlated subquery could be rewritten

  SELECT dm2.timestamp
    FROM dup_mail dm2
   WHERE dm2.email = dm1.email 
   ORDER
      BY dm2.has_sales DESC
       , dm2.timestamp ASC 
   LIMIT 1

This is going to sort the rows with has_sales=1 before rows with has_sales=0, and then by timestamp. The LIMIT 1 clause picks the first row (after the set is sorted.)

We are going to want a suitable index on dup_mail table with email as the leading column. Including timestamp and has_sales columns in the index would make it a covering index for the subquery.

This should satisfy the specification, but the correlated subquery may not be optimal in terms of performance.

SELECT dm1.uuid
     , dm1.email
     , dm1.timestamp
     , dm1.has_sales
  FROM dup_mail dm1 
 WHERE dm1.timestamp = 
       ( SELECT dm2.timestamp
           FROM dup_mail dm2
          WHERE dm2.email = dm1.email 
          ORDER
             BY dm2.has_sales DESC
              , dm2.timestamp ASC 
          LIMIT 1
      )
ORDER 
   BY ...

(It's a bit odd that timestamp would be unique across all of the rows; but if it is, then this query could work.)


We might get better performance with something like this:

SELECT dmx.email 
     , IF( MAX(dmx.has_sales)=0
         , MIN(dmx.timestamp)
         , MIN(IF(dmx.has_sales=1,dmx.timestamp,NULL))
       ) AS min_timestamp 
  FROM dup_email dmx
 GROUP BY dmx.email

And then using that as an inline view and join to the dup_mail table to get the rows associated with the minimum timestamp

SELECT dm1.uuid
     , dm1.email
     , dm1.timestamp
     , dm1.has_sales
  FROM ( -- minimum timestamp for each email
         SELECT dmx.email
              , IF( MAX(dmx.has_sales)=0
                  , MIN(dmx.timestamp)
                  , MIN(IF(dmx.has_sales=1,dmx.timestamp,NULL))
                ) AS min_timestamp 
           FROM dup_email dmx
          GROUP BY dmx.email
       ) m
  JOIN dup_email dm1
    ON dm1.email      = m.email 
   AND dm1.timestamp = m.min_timestamp
 ORDER
    BY ...

NOTE

SQL syntax given above is specific to MySQL (the question is tagged as MySQL).

I think the IF() function is a MySQL-only extension.

For PostgreSQL, replace this:

              , IF( MAX(dmx.has_sales)=0
                  , MIN(dmx.timestamp)
                  , MIN(IF(dmx.has_sales=1,dmx.timestamp,NULL))
                ) AS min_timestamp 

with a more portable, more ANSI standards compliant

              , CASE WHEN MAX(dmx.has_sales) = 0
                THEN MIN(dmx.timestamp)
                ELSE MIN( CASE WHEN dmx.has_sales = 1
                          THEN dmx.timestamp
                          END
                     )
                END AS min_timestamp
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you for answering. In this case, I would have to make a join with dm1 and dm2, right?
The first part of my answer (the original answer) was intended as a replacement for the correlated subquery in the parentheses in the original query. Answer is updated to add a demonstration of the original query with the replacement, as well as adding an alternative approach... getting the minimum timestamp for all email, as an inline view, and a join to get the rows that match that minimum timestamp.
The SQL in this answer is specific to MySQL. I'm much less familiar with PostgreSQL, I expect the IF() function is a MySQL-specific extension.
Thank you very much. As I am using Postegresql, I would suggest to change the if clause for something like this: SELECT COALESCE(case when MAX(dm2.has_sales=0 then MIN(dm2.timestamp) else NULL END, case when dm2.has_sales=1 then dm2.time_stamp end) as min_timestamp
Answer edited to include replacement for MySQL IF() functions with more portable, more ANSI standards compliant CASE expressions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.