0

In the following link, the creator of a tool I use (Airflow) suggests to create partitions for daily snapshots of dimension tables. I am wondering about the overhead of doing something like this in Postgres.

I am using the Postgres 10 built in partitioning for several tables, but mostly at a monthly or yearly level for facts. I never tried implementing a daily partition for dimensions before and it seems scary. It would simplify things though in several areas for me in case I need to rerun old tasks.

https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a

Simple. With dimension snapshots where a new partition is appended at each ETL schedule. The dimension table becomes a collection of dimension snapshots where each partition contains the full dimension as-of a point in time. “But only a small percentage of the data changes every day, that’s a lot of data duplication!”. That’s right, though typically dimension tables are negligible in size in proportion to facts. It’s also an elegant way to solve SCD-type problematic by its simplicity and reproducibility. Now that storage and compute are dirt cheap compared to engineering time, snapshoting dimensions make sense in most cases.

While the traditional type-2 slowly changing dimension approach is conceptually sound and may be more computationally efficient overall, it’s cumbersome to manage. The processes around this approach, like managing surrogate keys on dimensions and performing surrogate key lookup when loading facts, are error-prone, full of mutations and hardly reproducible.

1 Answer 1

2

I have worked with systems with different levels of partitioning. Generally any partitioning is OK as long as you have check constrains on partitions which allow query planner to find adequate partitions for query. Or you will have to query specific partition directly for some special cases. Otherwise you will see sequential scans over all partitions even for simple queries.

Daily partitions are completely OK do not worry. And I worked event with data collector based on PG which needed to have partitions for every 5 minutes of data because it collected several TBs per day.

Number of partitions can become a bigger problem only when you have several thousands or dozens of thousands of partitions - with this amount of partitions everything goes to different level of problems.

You will have to set proper max_locks_per_transaction for example to be able to work with them. Because even simple select over parent table places SharedAccessLock over all partitions which is not exactly nice but PG inheritance works this way.

Plus higher planing time for query - in our data warehouse we sometimes see planning times for queries like several minutes and queries taking only seconds - which is a bit craped... But it is hard to do anything with it because current PG planner works this way.

But PROs still overweight CONs so I highly recommend to use any partitioning granularity you need.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.