7

We have a table that saves information about the interval of the employees. Let's call it INTERVAL_TABLE.

We save when the user starts a interval and when he finishes. The user can start a interval as many times as he wants and finish as many times as he wants as well.

This is a simplified structure of the INTERVAL_TABLE:

   INTERVAL_ID | USER_ID | INTERVAL_TYPE_ID | INTERVAL_TIMESTAMP | ENTRY_TYPE

A user may have these entries in the table:

table possible entries

Now, we must create a report combining different entries of that table that refer to the same user and interval type. We should be able to identify intervals that have a start and an end and group these two in one row. Assuming the data in the image above, the output of the report should be the following:

report expected output

The output should be ordered by date, like the above image.

I have no idea how to create a query to do that.

Thanks!

Edit - Extra info:

To find the END interval for any INIT interval, we should find the closest END interval based on the timestamp of that interval. That's how we know we should match ID 1 with ID 2 and not with ID 3.

It's important to note that if a INIT interval is followed by another INIT interval (based on the timestamps), we should not proceed to find the END for that INIT. That is because this is a INIT without END.

6
  • 2
    Looking at the first "output" row. How does it determine that ID 2 matches to ID 1 and not 1 and 3 together? Is that based off of MIN time, or MIN ID? Or something else? This could help (LAG) stackoverflow.com/questions/17301816/lag-function-and-group-by Commented Mar 9, 2018 at 13:29
  • To find the END interval for any INIT interval, we should find the closest END interval based on the timestamp of that interval. That's how we know we should match ID 1 with ID 2 and not with ID 3. Commented Mar 9, 2018 at 13:32
  • It's important to note that if a INIT interval is followed by another INIT interval (based on the timestamps), we should not proceed to find the END for that INIT. That is because this is a INIT without END. Commented Mar 9, 2018 at 13:56
  • In fact, I have no idea how to create a query for that kind of stuff. I tried using the basics that I know about SQL and in the last night I ended up with one query with more than 200 lines and with one subquery for every column I need to show in the report (the table in the question is simplified). I'm pretty sure that there is a better (performance and readability) way of doing that. Commented Mar 11, 2018 at 17:48
  • Just read your edit, so if there are two INITs followed by an END, then the first INIT will not have an END, but the second will. Commented Mar 11, 2018 at 19:12

4 Answers 4

4
+50

It could be done easy and efficiently using LEAD and LAG functions. At least it is much more efficient than self-join of the table: O(n) vs O(n*n).

At first add columns for the next and previous row using LEAD and LAG with appropriate PARTITION BY.

Then build two sets of pairs - the first that starts with INIT_INTERVAL, the second that ends with END_INTERVAL. If there is a pair that has both Init and End - it will be included twice and later eliminated in UNION.

SQL Fiddle

Sample data (this is something that you should have included in your question in addition to the screenshot)

CREATE TABLE INTERVAL_TABLE (
  INTERVAL_ID int,
  USER_ID int,
  INTERVAL_TYPE_ID int,
  INTERVAL_TIMESTAMP timestamp,
  ENTRY_TYPE varchar(255));

INSERT INTO INTERVAL_TABLE (INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, INTERVAL_TIMESTAMP, ENTRY_TYPE) VALUES
(1, 1, 1, '2018-03-08 14:00:00', 'INIT_INTERVAL'),
(2, 1, 1, '2018-03-08 15:00:00', 'END_INTERVAL' ),
(3, 1, 1, '2018-03-08 15:30:00', 'END_INTERVAL' ),
(4, 1, 1, '2018-03-08 15:45:00', 'INIT_INTERVAL'),
(5, 1, 1, '2018-03-08 15:50:00', 'INIT_INTERVAL');

Query

WITH
CTE
AS
(
  SELECT
    USER_ID
    ,INTERVAL_TYPE_ID
    ,ENTRY_TYPE AS Curr_Entry_Type
    ,INTERVAL_TIMESTAMP AS Curr_Interval_Timestamp
    ,INTERVAL_ID AS Curr_Interval_ID

    ,LAG(ENTRY_TYPE) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Prev_Entry_Type
    ,LAG(INTERVAL_TIMESTAMP) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Prev_Interval_Timestamp
    ,LAG(INTERVAL_ID) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Prev_Interval_ID

    ,LEAD(ENTRY_TYPE) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Next_Entry_Type
    ,LEAD(INTERVAL_TIMESTAMP) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Next_Interval_Timestamp
    ,LEAD(INTERVAL_ID) OVER(PARTITION BY USER_ID, INTERVAL_TYPE_ID ORDER BY INTERVAL_TIMESTAMP) AS Next_Interval_ID
  FROM
    INTERVAL_TABLE
)
,CTE_Result
AS
(
  SELECT
    USER_ID
    ,INTERVAL_TYPE_ID
    ,Curr_Entry_Type AS Entry_Type_Init
    ,Curr_Interval_Timestamp AS Interval_Timestamp_Init
    ,Curr_Interval_ID AS Interval_ID_Init
    ,Next_Entry_Type AS Entry_Type_End
    ,CASE WHEN Next_Entry_Type = 'END_INTERVAL' THEN Next_Interval_Timestamp END AS Interval_Timestamp_End
    ,CASE WHEN Next_Entry_Type = 'END_INTERVAL' THEN Next_Interval_ID END AS Interval_ID_End
  FROM CTE
  WHERE Curr_Entry_Type = 'INIT_INTERVAL'

  UNION -- sic! not UNION ALL

  SELECT
    USER_ID
    ,INTERVAL_TYPE_ID
    ,Prev_Entry_Type AS Entry_Type_Init
    ,CASE WHEN Prev_Entry_Type = 'INIT_INTERVAL' THEN Prev_Interval_Timestamp END AS Interval_Timestamp_Init
    ,CASE WHEN Prev_Entry_Type = 'INIT_INTERVAL' THEN Prev_Interval_ID END AS Interval_ID_Init
    ,Curr_Entry_Type AS Entry_Type_End
    ,Curr_Interval_Timestamp AS Interval_Timestamp_End
    ,Curr_Interval_ID AS Interval_ID_End
  FROM CTE
  WHERE Curr_Entry_Type = 'END_INTERVAL'
)
SELECT
    USER_ID
    ,INTERVAL_TYPE_ID
    ,Interval_Timestamp_Init
    ,Interval_Timestamp_End
    ,Interval_ID_Init
    ,Interval_ID_End
FROM CTE_Result
ORDER BY
  USER_ID
  ,INTERVAL_TYPE_ID
  ,COALESCE(Interval_Timestamp_Init, Interval_Timestamp_End)

Results

| user_id | interval_type_id | interval_timestamp_init | interval_timestamp_end | interval_id_init | interval_id_end |
|---------|------------------|-------------------------|------------------------|------------------|-----------------|
|       1 |                1 |    2018-03-08T14:00:00Z |   2018-03-08T15:00:00Z |                1 |               2 |
|       1 |                1 |                  (null) |   2018-03-08T15:30:00Z |           (null) |               3 |
|       1 |                1 |    2018-03-08T15:45:00Z |                 (null) |                4 |          (null) |
|       1 |                1 |    2018-03-08T15:50:00Z |                 (null) |                5 |          (null) |
Sign up to request clarification or add additional context in comments.

7 Comments

@luiz I haven’t looked at this but a lead/lag should be efficient if it works. Though I’d be curious to see the actual performance difference
@vol7ron, it is easy to verify whether the query works by trying it on a set of sample data that covers all possible corner cases. The original question doesn't have a comprehensive test case, but I'm still confident that the query in my answer produces expected result. If you spot a case where it doesn't, please let us know. As for performance, it should be tested on real data and real hardware with proper indexes.
@VladimirBaranov Your answer, according to my tests, is the most efficient. Thanks!
@VladimirBaranov One problem that I'm currently facing is when I try to adapt to my real scenario. That is because in my real table some columns might have data for INITS but not for ENDS or vice versa. This is causing some rows to duplicate. Of course that this is not a problem with your solution but I would like to know, how would you deal with extra columns that might be null in INITS or ENDS?
@Luiz, please ask another question and include in it a set of sample data that shows all important cases from your real data. It must have more than 1 user_id, more than 1 interval_type_id, it must have a couple of extra columns with and without NULL values. Include this sample data as text, not as image; and as INSERT statements as I did in this answer. Show what result you expect. Show your current query and explain why its result is not exactly what you need. Format it all nicely. If you ask the question like this you will have very good chances to get good answers.
|
4

DBFiddle

This might not be the most efficient way to do this (I imagine a recursive query might be), but I find these subqueries easier to maintain:

WITH ordered_table AS (
  SELECT row_number() OVER(ORDER BY USER_ID,INTERVAL_TYPE_ID,INTERVAL_TIMESTAMP ASC) row_num, *
  FROM INTERVAL_TABLE
  ORDER BY row_num
),

_inits AS (
  SELECT
    t1.USER_ID,
    t1.INTERVAL_TYPE_ID      AS INTERVAL_TYPE,
    t1.INTERVAL_TIMESTAMP    AS INTERVAL_TIMESTAMP_INIT,
    CASE
      WHEN t1.ENTRY_TYPE = 'INIT_INTERVAL' 
       AND t2.ENTRY_TYPE = 'END_INTERVAL' 
       AND t1.USER_ID          = t2.USER_ID
       AND t1.INTERVAL_TYPE_ID = t2.INTERVAL_TYPE_ID      
      THEN t2.INTERVAL_TIMESTAMP 
    END                      AS INTERVAL_TIMESTAMP_END,
    t1.INTERVAL_ID           AS INTERVAL_ID_INIT,
    CASE
      WHEN t1.ENTRY_TYPE = 'INIT_INTERVAL' 
       AND t2.ENTRY_TYPE = 'END_INTERVAL' 
       AND t1.USER_ID          = t2.USER_ID
       AND t1.INTERVAL_TYPE_ID = t2.INTERVAL_TYPE_ID
      THEN t2.INTERVAL_ID 
    END                      AS INTERVAL_ID_END
  FROM      ordered_table AS t1
  LEFT JOIN ordered_table AS t2 ON (
    t1.row_num = t2.row_num - 1 AND
    t1.USER_ID = t2.USER_ID AND
    t1.INTERVAL_TYPE_ID = t2.INTERVAL_TYPE_ID
  )
  WHERE t1.ENTRY_TYPE = 'INIT_INTERVAL'
),

_ends AS (
  SELECT
    t2.USER_ID,
    t2.INTERVAL_TYPE_ID      AS INTERVAL_TYPE,
    NULL::timestamp          AS INTERVAL_TIMESTAMP_INIT,
    CASE
      WHEN (
          t1.ENTRY_TYPE = 'END_INTERVAL' AND
          t2.ENTRY_TYPE = 'END_INTERVAL'
        ) 
        OR (t1.ENTRY_TYPE IS NULL) -- case when first record for USER_ID and INTERVAL_TYPE_ID is an END
      THEN t2.INTERVAL_TIMESTAMP 
    END                      AS INTERVAL_TIMESTAMP_END,
    NULL::int                AS INTERVAL_ID_INIT,
    t2.INTERVAL_ID           AS INTERVAL_ID_END
  FROM       ordered_table AS t1
  RIGHT JOIN ordered_table AS t2 ON (
    t1.row_num = t2.row_num - 1 AND
    t1.USER_ID = t2.USER_ID AND
    t1.INTERVAL_TYPE_ID = t2.INTERVAL_TYPE_ID
  )
  WHERE t2.ENTRY_TYPE = 'END_INTERVAL'
)

SELECT * FROM (
  SELECT * FROM _inits
  UNION ALL
  SELECT * FROM _ends
) qry
WHERE 
  COALESCE(interval_timestamp_init, interval_timestamp_end) IS NOT NULL
ORDER BY 
  USER_ID, 
  INTERVAL_TYPE, 
  COALESCE(interval_timestamp_init, interval_timestamp_end)

Basically, INITs will always be listed. They will either have an associated END or a null. So almost all the content from _inits will be there.

Because the ENDs were already captured by the INITs, we only need to capture the ones that don't have an INIT (they were preceded by an END).

Because they are outer joins, you simply can remove the cases where INIT and END both are NULL and apply proper ordering.

5 Comments

Thanks for your answer! But this solution does not properly handle all cases. As I described here: Now, we must create a report combining different entries of that table that refer to the same user and interval type. I updated your fiddle to reflect such scenario. Your solution is mixing inits and ends of different interval types.
@luiz you’re absolutely correct. I wrote that caveat in the answer, but I must have closed the browser tab before the save registered. Good thing, though is it should be easy to update. When I get back to a non-mobile device I’ll take a look
@Luiz update — see wasn't too hard to make adjustments — I find SQL is really easy to set-and-forget and complicated SQL is really easy to comprehend years later. Keeping it simple is very helpful to future proof it. Realistically, though, databases can be a bottleneck and using the most efficient method may be your only option :)
I'll take a look in your adjustments. And yeah, you are right! In my performance tests your solution was really close to the solution proposed by Vladimir. But I believe that with ruge datasets and comparing lots of executions the solution proposed by Vladimir is more efficient - although I think it's more complicated to understand. :s
@Luiz I updated the sample data and realized a row was missing. When USER_ID/INTERVAL_TYPE_ID starts a new row with an "END", it wasn't being captured. I made adjustments to catch this edge case. That said, I still think a RECURSIVE or a LEAD/LAG (window functions) would be more efficient, but I'm not certain since this would require multiple window functions instead of a few joins.. I'd be curious to see the performance comparisons.
2

This query gives the output you need:

WITH Intervals AS
(
    WITH Events AS
    (
        WITH OrderedEvents AS
        (
            SELECT INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, INTERVAL_TIMESTAMP, ENTRY_TYPE, row_number() over (partition by USER_ID, INTERVAL_TYPE_ID order by INTERVAL_TIMESTAMP ASC) AS EVENT_ORDER FROM INTERVAL_TABLE
            UNION ALL
            SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'INIT_INTERVAL' AS ENTRY_TYPE, 0 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID
            UNION ALL
            SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'END_INTERVAL' AS ENTRY_TYPE, COUNT(*) + 1 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID
        )
        SELECT Events1.USER_ID, Events1.INTERVAL_TYPE_ID, Events1.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_INIT, Events2.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_END, Events1.INTERVAL_ID AS INTERVAL_ID_INIT, Events2.INTERVAL_ID  AS INTERVAL_ID_END, Events1.ENTRY_TYPE AS ENTRY_TYPE1, Events2.ENTRY_TYPE AS ENTRY_TYPE2
        FROM OrderedEvents Events1 INNER JOIN
        OrderedEvents Events2
        ON Events1.USER_ID = Events2.USER_ID AND Events1.INTERVAL_TYPE_ID = Events2.INTERVAL_TYPE_ID AND Events1.EVENT_ORDER + 1 = Events2.EVENT_ORDER
    )
    SELECT USER_ID, INTERVAL_TYPE_ID,

      CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_TIMESTAMP_INIT
           WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN INTERVAL_TIMESTAMP_INIT
           WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN NULL
      END AS INTERVAL_TIMESTAMP_INIT,

      CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_TIMESTAMP_END
           WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN NULL
           WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_TIMESTAMP_END
      END AS INTERVAL_TIMESTAMP_END,

      CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_ID_INIT
           WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN INTERVAL_ID_INIT
           WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN NULL
      END AS INTERVAL_ID_INIT,

      CASE WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_ID_END
           WHEN ENTRY_TYPE1 = 'INIT_INTERVAL' AND ENTRY_TYPE2 = 'INIT_INTERVAL' THEN NULL
           WHEN ENTRY_TYPE1 = 'END_INTERVAL' AND ENTRY_TYPE2 = 'END_INTERVAL' THEN INTERVAL_ID_END
      END AS INTERVAL_ID_END

    FROM Events
)
SELECT * FROM Intervals WHERE INTERVAL_ID_INIT IS NOT NULL OR INTERVAL_ID_END IS NOT NULL;

At first, we build OrderedEvents CTE that groups entries by USER_ID and INTERVAL_TYPE_ID, sorts them by INTERVAL_TIMESTAMP within each group and assign numeric order to each event. Also for each group we add INIT_INTERVAL as first event and END_INTERVAL as last event to cover cases when group starts with END_INTERVAL or finishes with INIT_INTERVAL:

WITH OrderedEvents AS
(
    SELECT INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, INTERVAL_TIMESTAMP, ENTRY_TYPE, row_number() over (partition by USER_ID, INTERVAL_TYPE_ID order by INTERVAL_TIMESTAMP ASC) AS EVENT_ORDER FROM INTERVAL_TABLE
    UNION ALL
    SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'INIT_INTERVAL' AS ENTRY_TYPE, 0 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID
    UNION ALL
    SELECT NULL AS INTERVAL_ID, USER_ID, INTERVAL_TYPE_ID, NULL AS INTERVAL_TIMESTAMP, 'END_INTERVAL' AS ENTRY_TYPE, COUNT(*) + 1 AS EVENT_ORDER FROM INTERVAL_TABLE GROUP BY USER_ID, INTERVAL_TYPE_ID
)
SELECT * FROM OrderedEvents ORDER BY user_id, interval_type_id, event_order;

This query gives following results for the provided data:

enter image description here

Then we intersect OrderedEvents with itself on USER_ID and INTERVAL_TYPE_ID and select pairs of neighbor events (Events1.EVENT_ORDER + 1 = Events2.EVENT_ORDER):

WITH OrderedEvents AS
(
    ...
)
SELECT Events1.USER_ID, Events1.INTERVAL_TYPE_ID, Events1.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_INIT, Events2.INTERVAL_TIMESTAMP AS INTERVAL_TIMESTAMP_END, Events1.INTERVAL_ID AS INTERVAL_ID_INIT, Events2.INTERVAL_ID  AS INTERVAL_ID_END, Events1.ENTRY_TYPE AS ENTRY_TYPE1, Events2.ENTRY_TYPE AS ENTRY_TYPE2
FROM OrderedEvents Events1 INNER JOIN
OrderedEvents Events2
ON Events1.USER_ID = Events2.USER_ID AND Events1.INTERVAL_TYPE_ID = Events2.INTERVAL_TYPE_ID AND Events1.EVENT_ORDER + 1 = Events2.EVENT_ORDER

This query gives following results:

enter image description here

Now we should transform these pairs of neighbor events to intervals based on the logic you described. Previous output has columns entry_type1 and entry_type2 which could take values of INIT_INTERVAL or END_INTERVAL. The possible combinations are:

  • <INIT_INTERVAL, END_INTERVAL> - this is the most natural case when INIT_INTERVAL is followed by END_INTERVAL. We take event values as is.
  • <INIT_INTERVAL(1), INIT_INTERVAL(2)> - the case of two consecutive INIT_INTERVAL. We force ending of the interval by taking <INIT_INTERVAL(1), NULL>. INIT_INTERVAL(2) will be taken with the next pair when it will be in the first entry.
  • <END_INTERVAL(1), END_INTERVAL(2)> - the case of two consecutive END_INTERVAL. We force start of the interval by taking <NULL, END_INTERVAL(2)>. END_INTERVAL(1) is processed either by case #1 or by the current case when it is the second entry in the pair.
  • <END_INTERVAL, INIT_INTERVAL> - such pairs are just skipped. END_INTERVAL is taken either by case #1 or case #3. INIT_INTERVAL is taken either by case #1 or case #2.

All this logic is put into set of CASE expressions. There are 4 such expressions with duplicated conditions, because we conditionally select 4 different columns (INTERVAL_TIMESTAMP_INIT, INTERVAL_TIMESTAMP_END, INTERVAL_ID_INIT and INTERVAL_ID_END) which could not be done with one CASE expression.

The final output is the same as you described:

enter image description here

1 Comment

The required kind of logic can be done with LEAD and LAG without the self-join. Usually it is much more efficient. Have a look at my answer.
1

You can use the INTERVAL_ID (or a new column with a generated row_number) to join two instances of the same table, using as predicate something like this:

on a.INTERVAL_ID=b.INTERVAL_ID + 1

This way, you can compare and get in 1 line each record with the next one.

2 Comments

There is nothing that guarantee that the next INTERVAL_ID will be the closest next timestamp. We should consider the timestamp.
Yes a better way would be a ROW_NUMBER ordered by that Timestamp, and perform the join using this ROW_NUMBER instead of the INTERVAL_ID. The first approach was just a way to start to solve the issue...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.