PostgreSQL optimization: average over range of dates

Question

I have a query (with a subquery) that calculates an average of temperatures over the previous years, plus/minus one week per each day. It works, but it is not all that fast. The time series values below are just an example. Why I'm using doy is because I want a sliding window around the same date for every year.

SELECT days,
    (SELECT avg(temperature)
     FROM temperatures
     WHERE site_id = ? AND
      extract(doy FROM timestamp) BETWEEN
      extract(doy FROM days) - 7 AND extract(doy FROM days) + 7
    ) AS temperature
FROM generate_series('2017-05-01'::date, '2017-08-31'::date, interval '1 day') days

So my question is, could this query somehow be improved? I was thinking about using some kind of window function or possibly lag and lead. However at least regular window functions only work on specific amount of rows, whereas there can be any number of measurements within the two-week window.

I can live with what I have for now, but as the amount of data grows so does the execution speed of the query. The two latter extracts could be removed, but that has no noticeable speed improvement and only makes the query less legible. Any help would be greatly appreciated.

Search for the term "sargable" and I suggest providing an explain plan for your existing query. — Paul Maxwell
– Paul Maxwell, Commented May 23, 2017 at 23:52

pozs · Accepted Answer · 2017-05-24 17:46:39Z

1

The best index for your original query is

create index idx_temperatures_site_id_timestamp_doy
  on temperatures(site_id, extract(doy from timestamp));

This can greatly improve your original query's performance.

While your original query is simple & readable, it has 1 flaw: it will calculate every day's average 14 times (on average). Instead, you could calculate these averages on a daily basis & calculate the 2 week window's weighted average (the weight for a day's average needs to be count of the individual rows in your original table). Something like this:

with p as (
  select timestamp '2017-05-01' min,
         timestamp '2017-08-31' max
)
select     t.*
from       p
cross join (select     days, sum(sum(temperature)) over pn1week / sum(count(temperature)) over pn1week
            from       p
            cross join generate_series(min - interval '1 week', max + interval '1 week', interval '1 day') days
            left join  temperatures on site_id = ? and extract(doy from timestamp) = extract(doy from days)
            group by   days
            window     pn1week as (order by days rows between 7 preceding and 7 following)) t
where      days between min and max
order by   days

However, there is not much gain here, as this is only twice as fast as your original query (with the optimal index).

http://rextester.com/JCAG41071

Notes: I used timestamp because I assumed your column's type is timestamp. But as it turned out, you use timestamptz (aka. timestamp with time zone). With that type, you cannot index the extract(doy from timestamp) expression, because that expression's output is dependent of the actual client's time zone setting.

For timestamptz use an index which (at least) starts with site_id. Using the window version should improve the performance anyway.

http://rextester.com/XTJSM42954

edited May 24, 2017 at 17:46

answered May 24, 2017 at 13:43

pozs

36.6k5 gold badges61 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Teemu Karimerto Over a year ago

An interesting approach, and certainly much faster than my original one. My initial attempt was indeed indexing the table on doy but that does not work because apparently extract doy is not immutable. In any case, this works much much faster with the data I have.

pozs Over a year ago

@TeemuKarimerto that's because your column is actually timestamptz. Please see my edits (at notes).

Teemu Karimerto Over a year ago

Ah yes, that seems to be the issue with the indexing. I would prefer to use timestamp but these are all Django-generated tables and I'm not entirely sure how I ought to go about possibly converting the values in the database AND configuring Django so nothing breaks :D

Teemu Karimerto Over a year ago

It seems like trying to force Django to use timestamps without time zones is a bad idea. So I'm just going to skip the doy-based indexing and go with this query as it is certainly much faster than my original one.

Collectives™ on Stack Overflow

PostgreSQL optimization: average over range of dates

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related