@@ -906,6 +906,8 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
906906 <secondary>of the planner</secondary>
907907 </indexterm>
908908
909+ <sect2>
910+ <title>Single-Column Statistics</title>
909911 <para>
910912 As we saw in the previous section, the query planner needs to estimate
911913 the number of rows retrieved by a query in order to make good choices
@@ -1043,7 +1045,200 @@ WHERE tablename = 'road';
10431045 Further details about the planner's use of statistics can be found in
10441046 <xref linkend="planner-stats-details">.
10451047 </para>
1048+ </sect2>
1049+
1050+ <sect2 id="planner-stats-extended">
1051+ <title>Extended Statistics</title>
1052+
1053+ <indexterm zone="planner-stats-extended">
1054+ <primary>statistics</primary>
1055+ <secondary>of the planner</secondary>
1056+ </indexterm>
10461057
1058+ <indexterm>
1059+ <primary>correlation</primary>
1060+ <secondary>in the query planner</secondary>
1061+ </indexterm>
1062+
1063+ <indexterm>
1064+ <primary>pg_statistic_ext</primary>
1065+ </indexterm>
1066+
1067+ <para>
1068+ It is common to see slow queries running bad execution plans because
1069+ multiple columns used in the query clauses are correlated.
1070+ The planner normally assumes that multiple conditions
1071+ are independent of each other,
1072+ an assumption that does not hold when column values are correlated.
1073+ Regular statistics, because of their per-individual-column nature,
1074+ do not capture the knowledge of cross-column correlation;
1075+ <firstterm>multivariate statistics</firstterm> can be used to instruct
1076+ the server to obtain statistics across such a set of columns,
1077+ which are later used by the query optimizer
1078+ to determine cardinality and selectivity
1079+ of clauses involving those columns.
1080+ Multivariate statistics are currently the only use of
1081+ <firstterm>extended statistics</firstterm>.
1082+ </para>
1083+
1084+ <para>
1085+ Extended statistics are created using
1086+ <xref linkend="sql-createstatistics">, which see for more details.
1087+ Data collection is deferred until the next <command>ANALYZE</command>
1088+ on the table, after which the stored values can be examined in the
1089+ <link linkend="catalog-pg-statistic-ext"><structname>pg_statistic_ext</structname></link>
1090+ catalog.
1091+ </para>
1092+
1093+ <para>
1094+ The following subsections describe the types of extended statistics
1095+ that are currently supported.
1096+ </para>
1097+
1098+ <sect3>
1099+ <title>Functional Dependencies</title>
1100+
1101+ <para>
1102+ The simplest type of extended statistics are functional dependencies,
1103+ a concept used in definitions of database normal forms.
1104+ Put simply, it is said that column <literal>b</> is functionally
1105+ dependent on column <literal>a</> if knowledge of the value of
1106+ <literal>a</> is sufficient to determine the value of <literal>b</>.
1107+ In normalized databases, functional dependencies are allowed only on
1108+ primary keys and superkeys. However, many data sets are in practice not
1109+ fully normalized for various reasons; intentional denormalization for
1110+ performance reasons is a common example.
1111+ </para>
1112+
1113+ <para>
1114+ The existance of functional dependencies directly affects the accuracy
1115+ of estimates in certain queries.
1116+ The reason is that conditions on the dependent columns do not
1117+ restrict the result set, but the query planner (lacking functional
1118+ dependency knowledge) considers them independent, resulting in
1119+ underestimates.
1120+ To inform the planner about the functional dependencies, we collect
1121+ measurements of dependency during <command>ANALYZE</>. Assessing
1122+ the degree of dependency between all sets of columns would be
1123+ prohibitively expensive, so the search is limited to potential
1124+ dependencies defined using the <literal>dependencies</> option of
1125+ extended statistics. It is advisable to create
1126+ <literal>dependencies</> statistics if and only if functional
1127+ dependencies actually exist, to avoid unnecessary overhead on both
1128+ <command>ANALYZE</> and query planning.
1129+ </para>
1130+
1131+ <para>
1132+ To inspect functional dependencies on a statistics
1133+ <literal>stts</literal>, you may do this:
1134+ <programlisting>
1135+ CREATE STATISTICS stts WITH (dependencies)
1136+ ON (zip, city) FROM zipcodes;
1137+ ANALYZE zipcodes;
1138+ SELECT stxname, stxkeys, stxdependencies
1139+ FROM pg_statistic_ext
1140+ WHERE stxname = 'stts';
1141+ stxname | stxkeys | stxdependencies
1142+ ---------+---------+--------------------------------------------
1143+ stts | 1 5 | [{1 => 5 : 1.000000}, {5 => 1 : 0.423130}]
1144+ (1 row)
1145+ </programlisting>
1146+ where it can be seen that column 1 (a zip code) fully determines column
1147+ 5 (city) so the coefficient is 1.0, while city only determines zip code
1148+ about 42% of the time, meaning that there are many cities (58%) that are
1149+ represented by more than a single ZIP code.
1150+ </para>
1151+
1152+ <para>
1153+ When computing the selectivity, the planner inspects all conditions and
1154+ attempts to identify which conditions are already implied by other
1155+ conditions. The selectivity estimates from any redundant conditions are
1156+ ignored from a selectivity point of view. In the example query above,
1157+ the selectivity estimates for either of the conditions may be eliminated,
1158+ thus improving the overall estimate.
1159+ </para>
1160+
1161+ <sect4>
1162+ <title>Limitations of Functional Dependencies</title>
1163+
1164+ <para>
1165+ Functional dependencies are a very simple type of statistics, and
1166+ as such have several limitations. The first limitation is that they
1167+ only work with simple equality conditions, comparing columns and constant
1168+ values. It's not possible to use them to eliminate equality conditions
1169+ comparing two columns or a column to an expression, range clauses,
1170+ <literal>LIKE</> or any other type of conditions.
1171+ </para>
1172+
1173+ <para>
1174+ When eliminating the implied conditions, the planner assumes that the
1175+ conditions are compatible. Consider the following example, where
1176+ this assumption does not hold:
1177+
1178+ <programlisting>
1179+ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
1180+ QUERY PLAN
1181+ -----------------------------------------------------------------------------
1182+ Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=0 loops=1)
1183+ Filter: ((a = 1) AND (b = 10))
1184+ Rows Removed by Filter: 10000
1185+ </programlisting>
1186+
1187+ While there are no rows with such combination of values, the planner
1188+ is unable to verify whether the values match — it only knows that
1189+ the columns are functionally dependent.
1190+ </para>
1191+
1192+ <para>
1193+ This assumption is related to queries executed on the database; in many
1194+ cases, it's actually satisfied (e.g. when the GUI only allows selecting
1195+ compatible values). But if that's not the case, functional dependencies
1196+ may not be a viable option.
1197+ </para>
1198+ </sect4>
1199+ </sect3>
1200+
1201+ <sect3>
1202+ <title>Multivariate N-Distinct Coefficients</title>
1203+
1204+ <para>
1205+ Single-column statistics store the number of distinct values in each
1206+ column. Estimates of the number of distinct values on more than one
1207+ column (for example, for <literal>GROUP BY a, b</literal>) are
1208+ frequently wrong when the planner only has single-column statistical
1209+ data, however, causing it to select bad plans.
1210+ In order to improve n-distinct estimation when multiple columns are
1211+ grouped together, the <literal>ndistinct</> option of extended statistics
1212+ can be used, which instructs <command>ANALYZE</> to collect n-distinct
1213+ estimates for all possible combinations of two or more columns of the set
1214+ of columns in the statistics object (the per-column estimates are already
1215+ available in <structname>pg_statistic</>).
1216+ </para>
1217+
1218+ <para>
1219+ Continuing the above example, the n-distinct coefficients in a ZIP
1220+ code table may look like the following:
1221+ <programlisting>
1222+ CREATE STATISTICS stts2 WITH (ndistinct)
1223+ ON (zip, state, city) FROM zipcodes;
1224+ ANALYZE zipcodes;
1225+ SELECT stxkeys AS k, stxndistinct AS nd
1226+ FROM pg_statistic_ext
1227+ WHERE stxname = 'stts2';
1228+ -[ RECORD 1 ]---------------------------------------------
1229+ k | 1 2 5
1230+ nd | [{(b 1 2), 33178.000000}, {(b 1 5), 33178.000000},
1231+ {(b 2 5), 27435.000000}, {(b 1 2 5), 33178.000000}]
1232+ (1 row)
1233+ </programlisting>
1234+ which indicates that there are three combinations of columns that
1235+ have 33178 distinct values: ZIP code and state; ZIP code and city;
1236+ and ZIP code, city and state (the fact that they are all equal is
1237+ expected given the nature of ZIP-code data). On the other hand,
1238+ the combination of city and state only has 27435 distinct values.
1239+ </para>
1240+ </sect3>
1241+ </sect2>
10471242 </sect1>
10481243
10491244 <sect1 id="explicit-joins">
0 commit comments