SQL remove duplicate rows based on other rows

Question

I have two tables (one that shows a record of all the programs sent to a customer, and another that shows a record of all the programs the user took action on). My ultimate goal is to count how many of those programs that were sent to the user were acted upon. I have a way of doing that right now but there is a problem. The programs may be sent to the user multiple times (I will use 20 for an example) within a single day. If the user takes action on that program within 24 hours, I don't want it to count as 19 programs that weren't acted on and 1 program that was acted on. I want it to count as 1 acted upon program with 0 fails.

The table that contains the programs sent to the customer needs to be filtered down so that each program that is sent to the customer only shows once per 24 hour period.

My solution is to have a "window" type thing so that once a program is sent to a customer, it is 'locked' for that customer for 24 hours so it won't appear in my 'programs sent to customer' query multiple times.

I have a table like this:

Customer    Time    Program
-----------------------------------
1           8:05    a
1           10:30   a
1           11:30   a
1           12:30   b
1           1:25    a
2           9:38    b
2           10:38   c
2           1:36    c
2           2:40    c
2           3:41    b
.
.
.

I want to get a table (query, not delete) that removes duplicate programs per customer within a certain time frame (confusing! I know)

Here is what I want (with a 3hr time frame for example):

Customer    Time    Program
-----------------------------------
1           8:05    a
1           11:30   a
1           12:30   b
2           9:38    b
2           10:38   c
2           2:40    c
2           3:41    b
.
.
.

what kind of column is "Time" datetime, timestamp, etc??? also what rdbms and version are you using? — Saic Siquot
– Saic Siquot, Commented Feb 21, 2013 at 15:33
And what is the datatype of Time? It's going to be really hard - for any code - to identify that 1:36 is after 10:38. — ypercubeᵀᴹ
– ypercubeᵀᴹ, Commented Feb 21, 2013 at 15:34
If no answer solved your problem, clarify here, don't post a new identical question. Sorry if this is conflicting with other advice you got but new question means you have a new problem, at least a bit different than an existing solved one. — ypercubeᵀᴹ
– ypercubeᵀᴹ, Commented Feb 21, 2013 at 21:56

Gordon Linoff · Accepted Answer · 2013-02-21 15:48:01Z

2

Try this:

select *
from t
where not exists (select 1 from t t2
                  where t2.customer = t.customer and
                        t2.program = t.program and
                        t2.time - t.time < 3.0/24 and
                        t2.time > t.time
                 )

Date time arithmetic depends on databases, but this works for many of them.

edited Feb 21, 2013 at 15:48

answered Feb 21, 2013 at 15:34

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

ypercubeᵀᴹ Over a year ago

This is on the right direction but it will not show the wanted results.

Gordon Linoff Over a year ago

@ypercube . . . I assume you are referring to the "same program" part of the question. I just added that in.

ypercubeᵀᴹ Over a year ago

No. Besides that, it's not what the OP wants. The "rolling window" requirement can be solved only with a recursive query, I think.

Gordon Linoff Over a year ago

@ypercube . . . I do not read the question that way. And, the second line of "a" in the desired output supports my reading of the question.

John Verrone Over a year ago

I am going to mark this as the correct answer because it is really close to what I want. The 'moving window' that is only possible with the recursive function is just not going to happen because I ran the recursive one for about an hour and got impatient so I gave up on that. Thank you @ypercube gordon

|

ypercubeᵀᴹ · Accepted Answer · 2013-02-21 21:50:29Z

I think that what you mean can be solved (only) with a recursive query. Here is a solution:

WITH cte AS
  ( SELECT 
        customer, program, time,
        ROW_NUMBER() 
            OVER (PARTITION BY customer, program
                  ORDER BY time)
          AS rn, 
        MIN(time) 
            OVER (PARTITION BY customer, program
                  ORDER BY time 
                  RANGE BETWEEN 3.0/24 FOLLOWING 
                            AND UNBOUNDED FOLLOWING)
          AS next_time
    FROM a
  )

SELECT 
    customer, time, program
FROM 
    cte
START WITH rn = 1
CONNECT BY PRIOR customer = customer
       AND PRIOR program = program
       AND PRIOR next_time = time
ORDER BY 
    customer, time, program ;

You can also replace MIN(time) with FIRST_VALUE(time) above and get the same results. It might be more efficient.

Tested at SQL-Fiddle

The efficiency on a big table is probably not going to be very good. You can try running the query with smaller set of data.

And you should at least add this index, so it does an index scan:

CREATE INDEX ix                -- choose a name for the index
  ON tableX                    -- the table name
  (customer, program, time) ;

You could also skip the final ordering or change it so it is more similar to the used index:

ORDER BY 
    customer, program, time ;

Collectives™ on Stack Overflow

SQL remove duplicate rows based on other rows

2 Answers 2

8 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related