I guess what is happening to you is this:
CREATE TABLE rare (x integer);
INSERT INTO rare
SELECT CASE WHEN random() < 0.05
THEN i
ELSE 0
END
FROM generate_series(1, 100000) AS i;
ANALYZE rare;
SELECT null_frac, n_distinct, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'rare'
AND attname = 'x';
null_frac | n_distinct | most_common_vals | most_common_freqs
-----------+------------+------------------+-------------------
0 | 1471 | {0} | {0.9526333}
(1 row)
SELECT count(DISTINCT x) FROM rare;
count
-------
4904
(1 row)
Whereas with NULL values, it looks like this:
TRUNCATE rare;
INSERT INTO rare
SELECT CASE WHEN random() < 0.05
THEN i
ELSE NULL
END
FROM generate_series(1, 100000) AS i;
ANALYZE rare;
SELECT null_frac, n_distinct, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'rare'
AND attname = 'x';
null_frac | n_distinct | most_common_vals | most_common_freqs
------------+--------------+------------------+-------------------
0.95093334 | -0.049066663 | |
(1 row)
The difference is that in the second case, PostgreSQL realizes that the non-null values are pretty much unique (they actually are unique), so it represents the number of distinct values as the negative ratio. Negative, to tell it apart from the normal case, and ratio, because that remains more accurate if the data change.
PostgreSQL takes only a sample of the table to calculate the statistics, so the absolute count of distinct values (1471 in this example) is of course far off the mark, but the fraction (when calculated with the fraction size!) is quite accurate.
In the first case, PostgreSQL would estimate
SELECT DISTINCT x FROM rare;
to have 1471 result rows, while in the second case it would estimate 100000 * 0.049066663 ≈ 4907 rows.
Now that explains what you see, but what can you do to improve the estimate?
The answer is to take a larger sample:
ALTER TABLE rare ALTER x SET STATISTICS 1000;
ANALYZE rare;
Then PostgreSQL will take a much larger sample and come up with a more accurate estimate.
NULL? Are you talking about the data or the query? Can you show us your table definition, indices, and your queries?