@@ -446,6 +446,160 @@ rows = (outer_cardinality * inner_cardinality) * selectivity
446446 in <filename>src/backend/utils/adt/selfuncs.c</filename>.
447447 </para>
448448
449+ <sect2 id="functional-dependencies">
450+ <title>Functional Dependencies</title>
451+
452+ <para>
453+ The simplest type of extended statistics are functional dependencies,
454+ used in definitions of database normal forms. When simplified, saying that
455+ <literal>b</> is functionally dependent on <literal>a</> means that
456+ knowledge of value of <literal>a</> is sufficient to determine value of
457+ <literal>b</>.
458+ </para>
459+
460+ <para>
461+ In normalized databases, only functional dependencies on primary keys
462+ and superkeys are allowed. However, in practice, many data sets are not
463+ fully normalized, for example, due to intentional denormalization for
464+ performance reasons.
465+ </para>
466+
467+ <para>
468+ Functional dependencies directly affect accuracy of the estimates, as
469+ conditions on the dependent column(s) do not restrict the result set,
470+ resulting in underestimates.
471+ </para>
472+
473+ <para>
474+ To inform the planner about the functional dependencies, we collect
475+ measurements of dependency during <command>ANALYZE</>. Assessing
476+ dependency between all sets of columns would be prohibitively
477+ expensive, so we limit our search to potential dependencies defined
478+ using the <command>CREATE STATISTICS</> command.
479+
480+ <programlisting>
481+ CREATE TABLE t (a INT, b INT);
482+ INSERT INTO t SELECT i/100, i/100 FROM generate_series(1,10000) s(i);
483+ CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t;
484+ ANALYZE t;
485+ EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
486+ QUERY PLAN
487+ -------------------------------------------------------------------------------------------------
488+ Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=0.095..3.118 rows=100 loops=1)
489+ Filter: ((a = 1) AND (b = 1))
490+ Rows Removed by Filter: 9900
491+ Planning time: 0.367 ms
492+ Execution time: 3.380 ms
493+ (5 rows)
494+ </programlisting>
495+
496+ The planner is now aware of the functional dependencies and considers
497+ them when computing the selectivity of the second condition. Running
498+ the query without the statistics would lead to quite different estimates.
499+
500+ <programlisting>
501+ DROP STATISTICS s1;
502+ EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
503+ QUERY PLAN
504+ -----------------------------------------------------------------------------------------------
505+ Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual time=0.000..6.379 rows=100 loops=1)
506+ Filter: ((a = 1) AND (b = 1))
507+ Rows Removed by Filter: 9900
508+ Planning time: 0.000 ms
509+ Execution time: 6.379 ms
510+ (5 rows)
511+ </programlisting>
512+ </para>
513+
514+ <para>
515+ If no dependency exists, the collected statistics do not influence the
516+ query plan. The only effect is to slow down <command>ANALYZE</>. Should
517+ partial dependencies exist these will also be stored and applied
518+ during planning.
519+ </para>
520+
521+ <para>
522+ Similarly to per-column statistics, extended statistics are stored in
523+ a system catalog called <structname>pg_statistic_ext</structname>, but
524+ there is also a more convenient view <structname>pg_stats_ext</structname>.
525+ To inspect the statistics <literal>s1</literal> defined above,
526+ you may do this:
527+
528+ <programlisting>
529+ SELECT tablename, staname, attnums, depsbytes
530+ FROM pg_stats_ext WHERE staname = 's1';
531+ tablename | staname | attnums | depsbytes
532+ -----------+---------+---------+-----------
533+ t | s1 | 1 2 | 40
534+ (1 row)
535+ </programlisting>
536+
537+ This shows that the statistics are defined on table <structname>t</>,
538+ <structfield>attnums</structfield> lists attribute numbers of columns
539+ (references <structname>pg_attribute</structname>). It also shows
540+ the length in bytes of the functional dependencies, as found by
541+ <command>ANALYZE</> when serialized into a <literal>bytea</> column.
542+ </para>
543+
544+ <para>
545+ When computing the selectivity, the planner inspects all conditions and
546+ attempts to identify which conditions are already implied by other
547+ conditions. The selectivity estimates from any redundant conditions are
548+ ignored from a selectivity point of view. In the example query above,
549+ the selectivity estimates for either of the conditions may be eliminated,
550+ thus improving the overall estimate.
551+ </para>
552+
553+ <sect3 id="functional-dependencies-limitations">
554+ <title>Limitations of functional dependencies</title>
555+
556+ <para>
557+ Functional dependencies are a very simple type of statistics, and
558+ as such have several limitations. The first limitation is that they
559+ only work with simple equality conditions, comparing columns and constant
560+ values. It's not possible to use them to eliminate equality conditions
561+ comparing two columns or a column to an expression, range clauses,
562+ <literal>LIKE</> or any other type of conditions.
563+ </para>
564+
565+ <para>
566+ When eliminating the implied conditions, the planner assumes that the
567+ conditions are compatible. Consider the following example, violating
568+ this assumption:
569+
570+ <programlisting>
571+ EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 10;
572+ QUERY PLAN
573+ -----------------------------------------------------------------------------------------------
574+ Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=2.992..2.992 rows=0 loops=1)
575+ Filter: ((a = 1) AND (b = 10))
576+ Rows Removed by Filter: 10000
577+ Planning time: 0.232 ms
578+ Execution time: 3.033 ms
579+ (5 rows)
580+ </programlisting>
581+
582+ While there are no rows with such combination of values, the planner
583+ is unable to verify whether the values match - it only knows that
584+ the columns are functionally dependent.
585+ </para>
586+
587+ <para>
588+ This assumption is more about queries executed on the database - in many
589+ cases, it's actually satisfied (e.g. when the GUI only allows selecting
590+ compatible values). But if that's not the case, functional dependencies
591+ may not be a viable option.
592+ </para>
593+
594+ <para>
595+ For additional information about functional dependencies, see
596+ <filename>src/backend/statistics/README.dependencies</>.
597+ </para>
598+
599+ </sect3>
600+
601+ </sect2>
602+
449603 </sect1>
450604
451605</chapter>
0 commit comments