PostgreSQL 13: Efficient use of partitions and tsrange

Question

I am trying to figure out how to make the most efficient usage of table partitions and the tsrange function. I am using PostgreSQL 13.

I will post here an example knowing that it is not efficient and understanding why but I am looking for advice on how to make this efficient and tuned. In the example below all partitions are read every time I run the select query. I would like to find if a boat trip overlaps with another one in a booking calendar.

CREATE TABLE boat_trips (
   id INTEGER NOT NULL
   , boat_name VARCHAR(32)
   , departure_time TIMESTAMP WITHOUT TIME ZONE  NOT NULL
   , destination_time TIMESTAMP WITHOUT TIME ZONE NOT NULL
) 
PARTITION BY RANGE (departure_time);
                          
CREATE TABLE IF NOT EXISTS boat_trips_20210206 PARTITION OF boat_trips FOR VALUES FROM ('2021-02-06 00:00:00') TO ('2021-02-06 23:59:59');
CREATE TABLE IF NOT EXISTS boat_trips_20210207 PARTITION OF boat_trips FOR VALUES FROM ('2021-02-07 00:00:00') TO ('2021-02-07 23:59:59');
CREATE TABLE IF NOT EXISTS boat_trips_20210208 PARTITION OF boat_trips FOR VALUES FROM ('2021-02-08 00:00:00') TO ('2021-02-08 23:59:59');
                           
INSERT INTO boat_trips VALUES (1, 'The Beautiful', '2021-02-06 11:15:00'::TIMESTAMP, '2021-02-06 12:15:00'::TIMESTAMP);
INSERT INTO boat_trips VALUES (2, 'The Incredible', '2021-02-06 13:15:00'::TIMESTAMP, '2021-02-06 14:15:00'::TIMESTAMP);
INSERT INTO boat_trips VALUES (3, 'The Beautiful', '2021-02-06 12:30:00'::TIMESTAMP, '2021-02-06 13:15:00'::TIMESTAMP);
INSERT INTO boat_trips VALUES (4, 'The Beautiful', '2021-02-07 11:15:00'::TIMESTAMP, '2021-02-07 12:15:00'::TIMESTAMP);
INSERT INTO boat_trips VALUES (5, 'The Incredible', '2021-02-07 13:15:00'::TIMESTAMP, '2021-02-07 14:15:00'::TIMESTAMP);
INSERT INTO boat_trips VALUES (6, 'The Beautiful', '2021-02-07 12:30:00'::TIMESTAMP, '2021-02-07 13:15:00'::TIMESTAMP);
INSERT INTO boat_trips VALUES (7, 'The Beautiful', '2021-02-08 11:15:00'::TIMESTAMP, '2021-02-08 12:15:00'::TIMESTAMP);
INSERT INTO boat_trips VALUES (8, 'The Incredible', '2021-02-08 13:15:00'::TIMESTAMP, '2021-02-08 14:15:00'::TIMESTAMP);
INSERT INTO boat_trips VALUES (9, 'The Beautiful', '2021-02-08 12:30:00'::TIMESTAMP, '2021-02-08 13:15:00'::TIMESTAMP);

Then I run the select query as follows:

SELECT DISTINCT bt.id, bt.boat_name, bt.departure_time, bt.destination_time
FROM  boat_trips bt
WHERE  tsrange(departure_time, destination_time) &&
       tsrange '[2021-02-07 00:00:00,2021-02-08 00:00:00)';

If I get the explain plan, I find that all partitions are parsed and it is normal because the partition key id departure_time.

How could I take advantage of the partitioning and the tsrange function in my select query?

I tried to partion on tsrange(departure_time, destination_time) which works but then I can't find the syntax to create the partitions.

There is an example here => db fiddle

Why do you want to implement partitioning? This isn't primarily a performance thing, but more a "data management" tool, to e.g. get rid of old rows quickly (by dropping a partition rather than DELETEing million of rows). How many rows do you expect in that table? Why doesn't a non-partitioned table work with an index on the tsrange? — user330315
– user330315, Commented Feb 7, 2021 at 8:44
It is because at the end of the day the table(s) will contain several million rows and the queries will mostly be querying the current day or the very few days in the future. But for reporting purpose we need to keep the history available. — Pierre H.
– Pierre H., Commented Feb 7, 2021 at 8:50
"Several million rows" is not enough to justify partitioning. This is considered a small table these days. — user330315
– user330315, Commented Feb 7, 2021 at 9:19
I understand your point of view and you are in some ways right. I am used to manage databases with tables growing beyond the billion of rows. This is not about "mine is bigger than yours" but rather to also take into account the size of the index and to avoid a mammoth table with a mammoth index. Partitioning doesn't hurt especially with relatively "big" tables. The number of rows is by the way rather irrelevant when discussing disk space. What about the number of columns and their content (blob?). So my example was simplified for the question. In reality I don't use 3 columns and boats ;) — Pierre H.
– Pierre H., Commented Feb 7, 2021 at 9:44
FOR VALUES FROM ('2021-02-06 00:00:00') TO ('2021-02-06 23:59:59'); The ending value should be the same as the starting value of the next partition, as it is tested with the equivalent of <, not <=. Currently you have a gap of one second each day where no partition will qualify, which may cause a mysterious and painful problem at some point, — jjanes
– jjanes, Commented Feb 7, 2021 at 18:34

bobflux · Accepted Answer · 2021-02-07 08:40:40Z

For scalar values like integers, there is an order relationship, so if you have a limit value L, you can decide any value x<L goes into partition foo, and any value x>=L goes into partition bar.

Ranges don't have an order relationship that can be used to make partitions...

Say you have ranges [x1,x2] and [y1,y2], in order to sort them into partitions, you must order them, which means define an operator "<=" which has the required properties:

a ≤ a (reflexivity)
if a ≤ b and b ≤ a then a = b (antisymmetry)
if a ≤ b and b ≤ c then a ≤ c (transitivity).

Postgres has such an operator for ranges, this is what it uses if you make a btree index on a range. Can't make a btree without an order operation. But this operator doesn't result in an order that is meaningful for ranges, because for ranges [x1,x2] <= [y1,y2] just compares the tuples (x1,x2) <= (y1,y2), in other words, it returns the result of the comparison of x1 and y1, and if equal, the result of the comparison of x2 and y2.

If you have a set of non-overlapping ranges, then you can define a proper order relation. But your boat trip departure and arrival times are not a non-overlapping set

This means, no matter the limit value L between partitions, there may be a range in the previous partition whose length is enough to make it overlap with ranges in the next partition.

However, if you know that no trip takes more than, say, a month... then no interval will be longer than a month. So if you partition by month, you know that some ranges in partition N-1 might overlap with ranges in partition N, but no range in partition N-1 will overlap with any range in partition N+1.

So, if you put a constraint on each partition to ensure the upper bound of each range can't be above a certain value, like that:

CHECK( destination_time < ... )

then you can add this to your query:

WHERE tsrange(departure_time, destination_time) && tsrange '[2021-02-07 00:00:00,2021-02-08 00:00:00)' AND destination_time < '2021-02-08 00:00:00';

constraint exclusion should eliminate the partitions that won't need to be scanned.

Note that range overlap tests can be accelerated rather massively with a GIST index. If your only reason to use partitions is performance, this may be a better solution. If you want to use partitions to manage a huge quantity of data and make deletion faster, then it could be optimal to just use a few big partitions and a gist index. A range query will hit the gist index on all partitions, but it shouldn't take more than a couple tens of microseconds per partition if there are no matching ranges in it.

Thank you very much for your clear and accurate answer. It is what I was looking for. I have taken your remark about the GIST index into account and it improves the performances dramatically.

Collectives™ on Stack Overflow

PostgreSQL 13: Efficient use of partitions and tsrange

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related