0

I'm currently learning UDFs and wrote the PostgreSQL UDF below to calculate the mean average deviation (MAD). It is the average absolute difference between the mean and the current value over any window. In python pandas/numpy, to find the MAD, we could write something like this:

series_mad = abs(series - series.mean()).mean()

Where series is a set of numbers and series_mad is a single numeric value representing the MAD of the series.

I'm trying to write this in PostgreSQL using Windows and UDF. So far, this is what I have:

CREATE TYPE misc_tuple AS (
    arr_store numeric[],
    ma_period integer
);

CREATE OR REPLACE FUNCTION mad_func(prev misc_tuple, curr numeric, ma_period integer)
    RETURNS misc_tuple AS $$
    BEGIN
        IF curr is null THEN
            RETURN (null::numeric[], -1);
        ELSEIF prev.arr_store is null THEN
            RETURN (ARRAY[curr]::numeric[], ma_period);
        ELSE
            -- accumulate new values in array
            prev.arr_store := array_append(prev.arr_store, curr);
            RETURN prev;
        END IF;
    END;
    $$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION mad_final(prev misc_tuple)
    RETURNS numeric AS $$
    DECLARE
        total_len integer;
        count numeric;
        mad_val numeric;
        mean_val numeric;
    BEGIN
        count := 0;
        mad_val := 0;
        mean_val := 0;
        total_len := array_length(prev.arr_store, 1);
        -- first loop to find the mean of the series
        FOR i IN greatest(1,total_len-prev.ma_period+1)..total_len
        LOOP 
            mean_val := mean_val + prev.arr_store[i];
            count := count + 1;
        END LOOP;
        mean_val := mean_val/NULLIF(count,0);
        -- second loop to subtract mean from each value 
        FOR i IN greatest(1,total_len-prev.ma_period+1)..total_len
        LOOP 
            mad_val := mad_val + abs(prev.arr_store[i]-mean_val);
        END LOOP;
        RETURN mad_val/NULLIF(count, 0);
    END;
    $$ LANGUAGE plpgsql;

CREATE OR REPLACE AGGREGATE mad(numeric, integer) (
    SFUNC = mad_func,
    STYPE = misc_tuple,
    FINALFUNC = mad_final
);

This is how I'm testing the performance:

-- find rolling 12-period MAD
SELECT x,
       mad(x, 12) OVER (ROWS 12-1 PRECEDING)
FROM generate_series(0,1000000) as g(x);

Currently, it takes ~45-50 secs on my desktop (i5 4670, 3.4 GHz, 16 GB RAM). I'm still learning UDFs, so I'm not sure what else I could do to my function to make it faster. I have a few other similar UDFs - but ones which don't use arrays and they take <15 secs on the same 1m rows. My guess is maybe I'm not efficiently looping the arrays or something could be done about the 2 loops in the UDF.

Are there any changes I can make to this UDF to make it faster?

4
  • Please provide sample test date as text, no images and the expected results. Commented Sep 5, 2020 at 20:34
  • @Belayer There are no images in my post. The question is not about the query's results, it's about the performance. Commented Sep 5, 2020 at 21:25
  • My request is for test data and the expected results. Request is not about what is in the question but what is not in it. Post that as text. Additionally descripe the problem you are trying to solve; not in just code. Commented Sep 5, 2020 at 22:48
  • @Belayer I'm sorry but I'm not sure how else I can clarify. There is no need for test data the generate_series already does that for you. The problem I'm trying to solve is to speed up the query I have mentioned in my post. I've also mentioned what the UDF calculates and my assumptions on why the code is slow. Since I'm just learning UDFs I was not sure what other techniques could be used to speed up my code and asked for suggestions - that is the question. Expected results are the same as what my code outputs but with much faster runtime. Commented Sep 6, 2020 at 7:07

1 Answer 1

1

Your example code does not work, you have an extra comma in the type definition, and you use an undefined variable cnt in one of the functions.

Why are you specifying 12 as both an argument to the aggregate itself, and in the ROWS PRECEDING? That seems redundant.

Your comparison to numpy doesn't seem very apt, as that is not a sliding window function.

I have a few other similar UDFs - but ones which don't use arrays and they take <15 secs on the same 1m rows

Are they also used as sliding window functions? Also written in plpgsql? Could you show one and it's usage?

pl/pgsql is not generally a very efficient language, especially not when manipulating large arrays. Although in your usage, the arrays never get very large, so I would not expect that to be particularly a problem.

One way to make it more efficient would be to write the code in C rather than pl/pgsql, using INTERNAL datatype rather than an SQL composite type.

Another way to improve this particular usage (a large number of windows each of which is small) might be to implement the MINVFUNC function and friends for this aggregate, so that it doesn't have to keep restarting the aggregation from scratch for every row.

Here is an example inverse function, which doesn't change the output at all, but does cut the run time in about half:

CREATE OR REPLACE FUNCTION mad_invfunc(prev misc_tuple, curr numeric, ma_period integer)
    RETURNS misc_tuple AS $$
    BEGIN
            -- remove prev value
            prev.arr_store := prev.arr_store[2:];
            RETURN prev;
    END;
    $$ LANGUAGE plpgsql;
CREATE OR REPLACE AGGREGATE mad(numeric, integer) (
    SFUNC = mad_func,
    STYPE = misc_tuple,
    FINALFUNC = mad_final,
    MSFUNC = mad_func,
    MSTYPE = misc_tuple,
    MFINALFUNC = mad_final,
    MINVFUNC = mad_invfunc
);

If I change the type from numeric to double precision everywhere they cuts the run time in half again. So while the loops over the array might not be terribly efficient, when using only 12-member windows they are not the main bottleneck.

Sign up to request clarification or add additional context in comments.

7 Comments

Sorry about the mistakes. I've fixed them now. -The reason I'm using 12 as an argument is because it's used to only select the last period values and then divide by period when finding out the mean. -It's not exactly numpy (sorry again for that), but rather the function I've mentioned above can be applied in pandas by using series.rolling(5).apply(<mean fn>, raw=True).
-The other functions do use sliding windows - I learned how to implement a simple SMA from this post. They are written in plpgsql but have no loops so it runs fast. Edit: the other functions I mentioned used MINVFUNC, but I'm not sure how I can use it with arrays.
Here's the link to the pandas implementation of the MAD statistic.
I've edited to include an example of inverse function.
The average function whose pl/pgsql definition you linked to is about 4 times slower than the built-in avg function is. I would expect a similar difference between a pl/pgsql and C implementation for this case. But making the C implementation would be a lot of work, especially if you haven't done C extensions to PostgreSQL before. And it will probably never be as fast as numpy, it it might make more sense to focus on a pipeline to move data back and forth.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.