Split rows with array into multiple rows with sub-arrays

Question

Let say I have table like this:

+-----+-----------------+-------------+
| ID  |     Points      | BreakPoints |
+-----+-----------------+-------------+
| 123 | {6,8,1,3,7,9}   | {1,7}       |
| 456 | {16,9,78,96,33} | {78}        |
+-----+-----------------+-------------+

I want to "break" these Points sequences on points contained in BreakPoints, while keeping ID of original row. Order of elements in sequence is important, so I cannot sort them!

Also notice, that break points are in both result rows that came from breaking original sequence at that break point (on the end and start respectively). So result should be something like this:


+-----+------------+
| ID  |   Points   |
+-----+------------+
| 123 | {6,8,1}    |
| 123 | {1,3,7}    |
| 123 | {7,9}      |
| 456 | {16,9,78}  |
| 456 | {78,96,33} |
+-----+------------+

Of course, I can write PL/pgSQL function, call it for every row, iterate the array and RETURN NEXT for every sub-sequence. But is there any other way, without calling function for all rows?

@GordonLinoff because it is about position in the array, not element "value". (elements are in fact IDs themselves, their "value" is irrelevant). — rouen
– rouen, Commented Jul 16, 2020 at 13:00

Marth · Accepted Answer · 2020-07-17 16:52:35Z

2

WITH data(id, points, breakpoints) AS (
    VALUES (123, ARRAY [6,8,1,3,7,9], ARRAY [7, 1])
         , (456, ARRAY [16,9,78,96,33], ARRAY [78])
),
-- we'll map the breakpoints to the indices where they appear in `points` and sort this array
-- so, ARRAY[1, 7] -> ARRAY[3, 5] (the positions of 1 & 7 in `points`, arrays are 1-based)
-- and ARRAY[7, 1] -> ARRAY[3, 5] (since we sort this new 'breakpoint_indices' array)
sorted_breakpoint_indices(id, points, breakpoint_indices, number_of_breakpoints) AS (
    SELECT id
         , points
         , breakpoint_indices
         , number_of_breakpoints
    FROM data
    JOIN LATERAL (
        SELECT ARRAY_AGG(array_position(points, breakpoint) ORDER BY array_position(points, breakpoint))
             , COUNT(*) -- simply here to avoid multiple `cardinality(breakpoint_indices)` below
        FROM unnest(breakpoints) AS breakpoint
    ) AS f(breakpoint_indices, number_of_breakpoints)
    ON true
)
SELECT id
     , CASE i
         -- first segment, from start to breakpoint #1
         WHEN 0 THEN points[:breakpoint_indices[1]]
         -- last segment, from last breakpoint to end
         WHEN number_of_breakpoints THEN points[breakpoint_indices[number_of_breakpoints]:]
         -- default case, bp i to i+1
         ELSE points[breakpoint_indices[i]:breakpoint_indices[i+1]]
       END
FROM sorted_breakpoint_indices
   , generate_series(0, number_of_breakpoints, 1) AS f(i)

returns

+---+----------+
|id |result    |
+---+----------+
|123|{6,8,1}   |
|123|{1,3,7}   |
|123|{7,9}     |
|456|{16,9,78} |
|456|{78,96,33}|
+---+----------+

Note: I wrote multiple other versions of this while writing this answer, these can be seen by looking at this post's edit history

edited Jul 17, 2020 at 16:52

answered Jul 16, 2020 at 13:16

Marth

24.9k3 gold badges68 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

rouen Over a year ago

Very interesting approach :) One more thing that is probably not clear for my sample (bummer): I have no assurance, that breakpoints are in the same order that they appear in the sequence. So, if you change {1,7} in the first row to {7,1}, it should still return my proposed result. I will try to adapt your solution to this.

Marth Over a year ago

@rouen: Ok, I might have a way to fix that. One question though, can breakpoints be duplicated? I.e breakpoints = Array[1, 1, 7] with 1 appearing twice in points (and being 'consumed' the first time you encounter it).

rouen Over a year ago

Nope, no duplicates in breakpoints. Also no duplicates possible in original sequences, if that matters :)

rouen Over a year ago

You are SQL beast, man! Thank you. Now for the interesting part: benchmarking this on millions of rows :)

rouen Over a year ago

Hey, just to let you know: it works really well on real dataset. It splits ~700k sequences in about 5 seconds on pretty average server :)

|

Gordon Linoff · Accepted Answer · 2020-07-16 14:43:47Z

1

I think this does what you want:

select t.id,
       array_agg(point order by point)
from t cross join
     unnest(points) point cross join lateral
     (select lag(breakpoint) over (order by breakpoint) as prev_breakpoint, breakpoint
      from unnest(t.breakpoints) breakpoint
      union all
      select max(breakpoint), null
      from unnest(t.breakpoints) breakpoint
     ) b
where (point >= prev_breakpoint or prev_breakpoint is null) and
      (point <= breakpoint or breakpoint is null)
group by t.id, breakpoint;

Here is a db<>fiddle.

EDIT:

Here is the revised code for solving your actual problem:

select id, grp,
       (case when lead(max(breakpoint)) over (partition by id order by grp) is not null
             then array_agg(point order by n) || lead(max(breakpoint)) over (partition by id order by grp)
             else array_agg(point order by n)
        end) as next_breakpoint
from (select t.id, p.*, breakpoint,
             count(breakpoint) over (partition by t.id order by p.n) as grp
      from t cross join
           unnest(points) with ordinality as p(point, n) left join
           unnest(breakpoints) as breakpoint
           on p.point = breakpoint
     ) t
group by id, grp;

This has been included in the db<>fidde.

The idea is pretty simply. Just return the position of each point and match to the breakpoints. Then a window function is used to define the groups.

The only complication is on the aggregation. You want the break point in both records. So that requires some manipulation. I think the array manipulation with lead() is simpler than alternatives.

edited Jul 16, 2020 at 14:43

answered Jul 16, 2020 at 12:27

Gordon Linoff

1.3m62 gold badges705 silver badges857 bronze badges

2 Comments

rouen Over a year ago

It does not. Maybe I should clarify that order of elements is important, I can not "sort" them. I will add that into question.

Gordon Linoff Over a year ago

@rouen . . . Well, I like my first answer, even if it is to the wrong question. You can always ask another question. More seriously, I updated the answer with a similar approach for your actual problem.

Collectives™ on Stack Overflow

Split rows with array into multiple rows with sub-arrays

2 Answers 2

6 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related