0

Using SQL Server 2012. Every night a data warehouse load populates a table of milestone dates that a loan goes through. The data looks like this:

CREATE TABLE TestData (LoanKey int, MilestoneCompletedDate datetime, Duration int)

INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (2, '2013-10-16 16:51:56.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (2, '2013-10-18 15:11:29.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (2, '2013-10-23 16:21:59.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (2, '2013-10-28 14:52:00.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (2, '2013-08-26 10:53:37.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (2, '2013-09-19 15:16:38.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (2, '2013-09-20 08:31:38.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (2, '2013-10-08 15:56:05.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (2, '2013-10-16 16:11:10.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (2, '2013-10-09 11:20:35.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (2, '2013-09-10 11:15:09.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (42, '2013-06-03 16:22:32.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (42, '2013-06-21 14:46:24.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (42, '2013-08-30 10:03:08.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (42, '2013-08-30 13:55:17.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (42, '2013-09-03 15:28:22.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (42, '2013-09-04 09:30:08.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (42, '2013-09-12 10:44:46.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (42, '2013-09-25 16:06:43.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (42, '2013-06-24 11:59:25.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (42, '2013-09-25 16:06:43.000')
INSERT TestData (LoanKey, MilestoneCompletedDate) VALUES (42, '2013-01-17 15:06:14.000')

After the data loads I want to update the "Duration" field. Here is some pseudo-code:

UPDATE TestData SET Duration = 'Find the DateDiff between the current rows MilestoneCompletedDate and the next greatest milestone completion date for the same loan' 

I can generate a row number with PARTITION BY and ORDER BY:

SELECT 
    LoanKey,
    MilestoneCompletedDate, 
    ROW_NUMBER() OVER (PARTITION BY LoanKey ORDER BY MilestoneCompletedDate DESC) AS SequenceNumber
FROM 
    [dbo].[TestData] 

Any ideas on where to go from here to populate Duration?

Thank you for looking!

2 Answers 2

1

Since you're on SQL Server 2012, you can use LEAD:

Accesses data from a subsequent row in the same result set without the use of a self-join in SQL Server 2012. LEAD provides access to a row at a given physical offset that follows the current row.

;With leads as (
    select *, LEAD(MilestoneCompletedDate) OVER
                 (PARTITION BY LoanKey
                  ORDER BY MilestoneCompletedDate) as NextCompletion
    from TestData
)
UPDATE leads SET Duration =DATEDIFF(second,MilestoneCompletedDate,NextCompletion)

select * from TestData

Produces:

LoanKey     MilestoneCompletedDate  Duration
----------- ----------------------- -----------
2           2013-10-16 16:51:56.000 166773
2           2013-10-18 15:11:29.000 436230
2           2013-10-23 16:21:59.000 426601
2           2013-10-28 14:52:00.000 NULL
2           2013-08-26 10:53:37.000 1297292
2           2013-09-19 15:16:38.000 62100
2           2013-09-20 08:31:38.000 1581867
2           2013-10-08 15:56:05.000 69870
2           2013-10-16 16:11:10.000 2446
2           2013-10-09 11:20:35.000 622235
2           2013-09-10 11:15:09.000 792089
42          2013-06-03 16:22:32.000 1549432
42          2013-06-21 14:46:24.000 249181
42          2013-08-30 10:03:08.000 13929
42          2013-08-30 13:55:17.000 351185
42          2013-09-03 15:28:22.000 64906
42          2013-09-04 09:30:08.000 695678
42          2013-09-12 10:44:46.000 1142517
42          2013-09-25 16:06:43.000 0
42          2013-06-24 11:59:25.000 5781823
42          2013-09-25 16:06:43.000 NULL
42          2013-01-17 15:06:14.000 11841378

On previous versions of SQL Server, I'd have taken your ROW_NUMBER() based query and done something the self-join that is being referred to in the LEAD documentation:

;With Ordered as (
    SELECT 
        LoanKey,
        MilestoneCompletedDate, 
        ROW_NUMBER() OVER (PARTITION BY LoanKey
                ORDER BY MilestoneCompletedDate DESC) AS SequenceNumber
    FROM 
        [dbo].[TestData]
)
UPDATE o1 SET Duration =
    DATEDIFF(second,o1.MilestoneCompletedDate,o2.MilestoneCompletedDate)
FROM Ordered o1
LEFT JOIN Ordered o2
ON o1.LoanKey = o2.LoanKey and o1.SequenceNumber = o2.SequenceNumber - 1
Sign up to request clarification or add additional context in comments.

4 Comments

Gah, nice answer, I'm completely oblivious to features post-2008, LEAD is great feature.
Thank you for this response! There is an issue with LEAD approach. The row defined by LoanKey = 2 and MilestoneCompletedDate = 2013-10-28 has a duration of NULL. It is the last milestone date for the loan. The previous milestone was 2013-10-23 so the duration should be 5. It looks like the correct durations are all shifted down one milestone. The second approach works with the addition of Duration to the Ordered table and reversing 02 and 01 in DATEDIFF.
@jjm - that's because you talked about taking a row and then finding the next greatest row, so I assumed you wanted to update the current row based on the next date. But from your comment, what you wanted (for any particular row) was to find the previous row. In that case, either swap LAG for LEAD or (logically the same) change the ORDER BY to be ASC instead of DESC.
I see ... very cool. I did not know about LAG and LEAD! This gets what I was looking for! Thank you!
1

This should do the job, change HOUR to whatever unit you like:

UPDATE a
SET Duration = DATEDIFF( HOUR 
                        ,(SELECT MAX(b.MilestoneCompletedDate) 
                          FROM TestData b
                          WHERE b.MilestoneCompletedDate < a.MilestoneCompletedDate
                            AND b.LoanKey = a.LoanKey) 
                        ,a.MilestoneCompletedDate )
FROM TestData a
WHERE Duration IS NULL

SQLFiddle seems to be broken for SQL Server at the moment, so can't post a fiddle but the top 4 rows (ordered by LoadKey, MilestoneCompletedDate) after the update come out as:

LoanKey  MilestoneCompletedDate   Duration
2        2013-08-26 10:53:37.000  NULL
2        2013-09-10 11:15:09.000  361
2        2013-09-19 15:16:38.000  220
2        2013-09-20 08:31:38.000  17

Alternatively, you could use ROW_NUMBER like you were thinking, but it's a little messy. Something like:

;WITH cte AS (
    SELECT 
        LoanKey,
        MilestoneCompletedDate, 
        Duration,
        ROW_NUMBER() OVER (PARTITION BY LoanKey 
                           ORDER BY MilestoneCompletedDate ASC) AS SeqNum
    FROM @TestData
    WHERE duration IS NULL
)
UPDATE a
SET Duration = DATEDIFF(HOUR
                       ,b.MilestoneCompletedDate
                       ,a.MilestoneCompletedDate)
FROM cte a
INNER JOIN cte b ON a.LoanKey = b.LoanKey 
                AND a.SeqNum = b.SeqNum + 1

which returns the same result set.

4 Comments

Whether the row numbers are actually computed twice or not is at the mercy of the optimizer, and isn't something I'd even look in the execution plan to discover unless there was a performance issue.
Fair play, removed. When chaining CTEs in the past I've seen significant performance gains by SELECTing from each as few times as possible but I realise there is no general rule.
Your first approach works for me and I like its simplicity. Thank you very much!
No problem, although it looks like Damien has a simple solution to the problem you were having with LEAD

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.