2

We have a table with 6M records. It contains hierarchical entities, i.e. each entity has a link to a parent - parent_id field.

95% of rows are top-level entities, i.e. they have an empty parrent_id field (empty string).

The recursive query for a single entity (to get all its children) has a performance issue. The query planner wrongly estimates children count and prefers "seq scan" over the "index scan" by parrent_id field. I think the planner works this way because we have an uneven distribution of parrent_id values. the pg_stats shows: most_common_value: '' n_distinct: 73 (in fact almost all child entities (5% of the table) has distinct parrent_id)

It seems like the analyzer takes not enough rows. However, when we change an empty string value to the NULL everything gets better. The planner use index.

I'm not an expert and I'd like to know if this trick with NULL is common practice, or it's just an accident and the analyzer process could spoil statistics over time and we back to slow performance?

2
  • 1
    What empty string do you change for NULL? Are you talking about the data or the query? Can you show us your table definition, indices, and your queries? Commented Jan 15, 2021 at 10:56
  • I think the question is quite clear. Commented Jan 15, 2021 at 16:48

1 Answer 1

2

I guess what is happening to you is this:

CREATE TABLE rare (x integer);

INSERT INTO rare
SELECT CASE WHEN random() < 0.05
            THEN i
            ELSE 0
       END
FROM generate_series(1, 100000) AS i;

ANALYZE rare;

SELECT null_frac, n_distinct, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'rare'
  AND attname = 'x';

 null_frac | n_distinct | most_common_vals | most_common_freqs 
-----------+------------+------------------+-------------------
         0 |       1471 | {0}              | {0.9526333}
(1 row)

SELECT count(DISTINCT x) FROM rare;

 count 
-------
  4904
(1 row)

Whereas with NULL values, it looks like this:

TRUNCATE rare;

INSERT INTO rare
SELECT CASE WHEN random() < 0.05
            THEN i
            ELSE NULL
       END
FROM generate_series(1, 100000) AS i;

ANALYZE rare;

SELECT null_frac, n_distinct, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'rare'
  AND attname = 'x';

 null_frac  |  n_distinct  | most_common_vals | most_common_freqs 
------------+--------------+------------------+-------------------
 0.95093334 | -0.049066663 |                  | 
(1 row)

The difference is that in the second case, PostgreSQL realizes that the non-null values are pretty much unique (they actually are unique), so it represents the number of distinct values as the negative ratio. Negative, to tell it apart from the normal case, and ratio, because that remains more accurate if the data change.

PostgreSQL takes only a sample of the table to calculate the statistics, so the absolute count of distinct values (1471 in this example) is of course far off the mark, but the fraction (when calculated with the fraction size!) is quite accurate.

In the first case, PostgreSQL would estimate

SELECT DISTINCT x FROM rare;

to have 1471 result rows, while in the second case it would estimate 100000 * 0.049066663 ≈ 4907 rows.

Now that explains what you see, but what can you do to improve the estimate?

The answer is to take a larger sample:

ALTER TABLE rare ALTER x SET STATISTICS 1000;

ANALYZE rare;

Then PostgreSQL will take a much larger sample and come up with a more accurate estimate.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.