In the following link, the creator of a tool I use (Airflow) suggests to create partitions for daily snapshots of dimension tables. I am wondering about the overhead of doing something like this in Postgres.
I am using the Postgres 10 built in partitioning for several tables, but mostly at a monthly or yearly level for facts. I never tried implementing a daily partition for dimensions before and it seems scary. It would simplify things though in several areas for me in case I need to rerun old tasks.
Simple. With dimension snapshots where a new partition is appended at each ETL schedule. The dimension table becomes a collection of dimension snapshots where each partition contains the full dimension as-of a point in time. “But only a small percentage of the data changes every day, that’s a lot of data duplication!”. That’s right, though typically dimension tables are negligible in size in proportion to facts. It’s also an elegant way to solve SCD-type problematic by its simplicity and reproducibility. Now that storage and compute are dirt cheap compared to engineering time, snapshoting dimensions make sense in most cases.
While the traditional type-2 slowly changing dimension approach is conceptually sound and may be more computationally efficient overall, it’s cumbersome to manage. The processes around this approach, like managing surrogate keys on dimensions and performing surrogate key lookup when loading facts, are error-prone, full of mutations and hardly reproducible.