4

I'm trying to define a PostgreSQL aggregate function that is aware of rows asked for in the frame clause but that are missing. Specifically, let's consider an aggregate function framer whose job is to return an array consisting of the values aggregated through it, with any missing values in the frame returned as null. So,

select
    n,
    v,
    framer(v) over (order by v rows between 2 preceding and 2 following) arr
from (values (1, 3200), (2, 2400), (3, 1600), (4, 2900), (5, 8200)) as v (n, v)
order by v

should return

"n" "v" "arr"
3   1600    {null,null,1600,2400,2900}
2   2400    {null,1600,2400,2900,3200}
4   2900    {1600,2400,2900,3200,8200}
1   3200    {2400,2900,3200,8200,null}
5   8200    {2900,3200,8200,null,null}

Basically I want to grab a range of values around each value, and it's important to me to know if I'm missing any to the left or to the right (or potentially both). Seems simple enough. I expected something like this to work:

create aggregate framer(anyelement) (
    sfunc = array_append,
    stype = anyarray,
    initcond = '{}'
);

but it returns

"n" "v" "arr"
3   1600    {1600,2400,2900}
2   2400    {1600,2400,2900,3200}
4   2900    {1600,2400,2900,3200,8200}
1   3200    {2400,2900,3200,8200}
5   8200    {2900,3200,8200}

So sfunc is really only being called three times when two of the values are missing.

I haven't been able to think of any non-ridiculous way to capture those missing rows. It seems like there should be a simple solution, like somehow prepending/appending some sentinel nulls to the data before the aggregate runs, or maybe somehow passing in the index (and frame values) as well as the actual value to the function...

I wanted to implement this as an aggregate because it gave the nicest user-facing experience for what I want to do. Is there any better way?

FWIW, I'm on postgres 9.6.

1 Answer 1

3

Ok, this was an interesting one. :)

I created an aggregate framer(anyarray, anyelement, int) so we can define the array size according to the window size.

First we replace array_append with our own framer_msfunc:

CREATE OR REPLACE FUNCTION public.framer_msfunc(arr anyarray, val anyelement, size_ integer)
 RETURNS anyarray
 LANGUAGE plpgsql
AS $function$
DECLARE
    result ALIAS FOR $0;
    null_ val%TYPE := NULL; -- NULL of the same type as `val`
BEGIN

    IF COALESCE(array_length(arr, 1), 0) = 0 THEN
        -- create an array of nulls with the size of `size_`
        result := array_fill(null_, ARRAY[size_]);
    ELSE
        result := arr;
    END IF;

    IF result[size_] IS NULL THEN
        -- first run or after `minvfunc`.
        -- a NULL inserted at the end in `minvfunc` so we want to replace that.
        result[size_] := val;
    ELSE
        -- `minvfunc` not yet called so we just append and drop the first.
        result := (array_append(result, val))[2:];
    END IF;

    RETURN result;

END;
$function$

Then we create a minvfunc as it is needed for moving aggregates.

CREATE OR REPLACE FUNCTION public.framer_minvfunc(arr anyarray, val anyelement, size_ integer)
 RETURNS anyarray
 LANGUAGE plpgsql
AS $function$
BEGIN

    -- drop the first in the array and append a null
    RETURN array_append(arr[2:], NULL);

END;
$function$

Then we define the aggregate with the moving aggregate arguments:

create aggregate framer(anyelement, int) (
    sfunc = framer_msfunc,
    stype = anyarray,
    msfunc = framer_msfunc,
    mstype = anyarray,
    minvfunc = framer_minvfunc,
    minitcond = '{}'
);

We put the framer_msfunc as sfunc too since sfunc is required, but it doesn't really work. It could be replaced with a fuction takes the same arguments but actually just calls array_append inside so it would actually do something useful.

And here's your example but with a couple more input values.

The frame size should be atleast the size of the window. It doesn't really work with smaller sizes.

select
    n,
    v,
    framer(v, 5) over (order by v rows between 2 preceding and 2 following) arr
from (values (1, 3200), (2, 2400), (3, 1600), (4, 2900), (5, 8200), (6, 2333), (7, 1500)) as v (n, v)
order by v
;
 n |  v   |            arr
---+------+----------------------------
 7 | 1500 | {NULL,NULL,1500,1600,2333}
 3 | 1600 | {NULL,1500,1600,2333,2400}
 6 | 2333 | {1500,1600,2333,2400,2900}
 2 | 2400 | {1600,2333,2400,2900,3200}
 4 | 2900 | {2333,2400,2900,3200,8200}
 1 | 3200 | {2400,2900,3200,8200,NULL}
 5 | 8200 | {2900,3200,8200,NULL,NULL}
(7 rows)

It would be nice if the size could be inferred from the window's size, but I coulnd't find if it could be done.

Sign up to request clarification or add additional context in comments.

3 Comments

I really liked this, but it doesn't work when size_ is larger than the amount of input. Consider select n, v, framer(v, 3) over (order by v rows between 1 preceding and 1 following) arr from (values (1, 32), (2, 24)) as v (n, v) order by v; should return {null, 24, 32}, {24, 32, null} but instead returns {null, 24, 32}, {null, 24, 32}. Postgres is calling framer_msfunc twice to build the first result, and never again, reusing the answer for the second result. Because Postgres is willing to cache results, I wonder if there might be any other trick cases where it breaks.
Ugh... Can't find anything that would skip the cache with window aggregates... Seems like a custom window function written in C could be the solution though.
upvoted! if you dont mind clarifying what exactly is postges caching in moving aggregate functions?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.