1

I have a table with ~35M rows, and trying to find "processed" records to remove from time to time. There are 14 valid statuses, and 10 of them are processed.

id uuid default uuid_generate_v4() not null primary key,
fk_id uuid not null references fk_table,
-- ... other columns
created_date timestamptz default now() not null,
status varchar(128) not null

Values for status can be one of a,b,c,d,e,f,g,h,i,j,k,l,m,n (14)

The index is on (status,created_date).
A query like:

select id from table
where created_date < 'somedate'
and status = ANY('{a,b,c,d,e,f,g,h,i,j}') -- one of first 10 values

The query planner insists on using a full seq_scan, instead of the index.

Is there a trick to make Postgres use the index for the status = ANY part of the predicate?

5
  • 1
    Could you please share the results from explain(analyze, verbose, buffers, settings and the complete DDL for all tables and indexes involved? All in plain text, as an update of the original question. Commented Jun 4, 2024 at 18:38
  • @FrankHeikens I will see what I can do. Commented Jun 4, 2024 at 19:50
  • Does it use the index if you use IN ( … ) syntax? Commented Jun 4, 2024 at 23:46
  • @Bergi no, it does not use the index in either case. Commented Jun 5, 2024 at 3:03
  • @Bergi you the answer from ErwinBrandstetter has a great sql fiddle demonstrating the behavior differences of the array. Commented Jun 5, 2024 at 18:47

2 Answers 2

2

It probably just thinks the seq scan will be faster, and for all we know it is correct about it being faster. You can force it try both plans, by changing the setting of "enable_seqscan" and getting the EXPLAIN (ANALYZE, BUFFERS) under each setting. That way we can see which one is actually faster (run it multiple times each way to make sure the get a consistent timing for each rather than just a one-time fluke) and what its estimated rows and counted rows are, to see if those are discordant.

If there is a column correlation so that those 10 statuses are selectively deficient in low-valued created_date, this skew might make it impossible to get good row count estimates. This type of skew is quite likely, given your description of how rows are removed from the table. And none of the currently implemented extended statistics types can likely fix this type of estimation problem.

But regardless of that, if you build the correct index so that it can use an index-only scan, (status,created_date,id), it should actually be faster and should also be estimated to be faster, and so will likely use that index even if the row estimates remain wrong. This is more likely to work than the partial index suggested by Erwin, because (alas) PostgreSQL does not use the size of partial indexes as part of its estimation process. So even though the index might be small, PostgreSQL will not use that knowledge to guide it into using the partial index.

Sign up to request clarification or add additional context in comments.

5 Comments

I'm a recovering MSSQL developer. In that engine, all non-clustered indexes implicitly contain the clustered index key columns for that row as part of their definition. Are you saying that in postgres, the clustered index key is not implicitly part of the non-clustered index definition, and you have to explicitly include it? I am sure that the non-clustered index would have to include at minimum the ctid, but I feel like what you're saying here is that once the non-clustered index gets scanned, it has a list of ctids, which it has to go back to the heap to exchange the ctid. Right?
this is gold. can you cite the source? > PostgreSQL does not use the size of partial indexes as part of its estimation process. I believe MSSQL does care about the size of an index (pages or total size, cannot remember)
@JJS PostgreSQL doesn't have clustered indexes (at least in the sense MSSQL uses the term). All indexes are secondary. The primary key is not automatically part of every index, if you want to be able to retrieve the primary key column from the index you need to explicitly include it in the index. So yes, it otherwise needs to visit the table to "exchange" the ctid for the pk.
@JSS I don't think there is a source for it. Just a lack of source. That is, there is no source code which uses the information, despite it being available. Unimplemented features usually aren't documented, unless it is needed to document a departure from the standard but that wouldn't apply here. The size of the index is used to estimate IO costs, but not row estimates.
> PostgreSQL doesn't have clustered indexes thanks. I still have a LOT to learn about how PostgreSQL is different than MSSQL. can you point me towards where it describes what purposes the primary key serves, and how it affects ordering of the heap?
1

If more than a few percent of rows qualify - or rather, if Postgres estimates as much - it will chose a sequential scan, which is faster for such a case.

If, in fact, only few rows qualify, then your column statistics (and/or cost settings) are to blame for misleading estimates.

If the cited index is only for the purpose at hand, and only relatively few rows have a "processed" state, replace it with a partial index:

CREATE INDEX foo ON tbl (created_date) WHERE status = ANY('{a,b,c..10}');

Would make the index much smaller, the query faster, and the likelihood it gets used bigger.

Either way, increasing the statistics target for created_date and status at least by a bit is most probably helpful. See:

And more aggressive autovacuum settings for the table:

Either way, for only "14 valid states", status varchar(128) seems tremendously wasteful.

Also, the planner has gotten smarter for this in Postgres 16 as compared to Postgres 12. Postgres 16 detects common values even in input arrays with many elements (many more than you have distinct values in status), and switches the plan accordingly. Not sure about the old logic in Postgres 12, but for more than ? elements in the array, Postgres used to switch to generic estimates, which can generate poor results.

But note that either version can even adapt the plan for prepared statements, based on actual input.

fiddle -- pg 16
fiddle -- pg 12

More depends on missing details ...

8 Comments

Can postgres actually do the estimation if the compared value is an array, does it know the array length and elements? What if the array value is passed as a parameter (not as a SQL literal in the query text)?
@Bergi the array value is passed as a parameter
@JJS That was my guess, and I fear that's one of the things that might affect the planner, so thanks for clarifying that detail. Being a dynamic subset of states certainly invalidates the idea of using a partial index.
@Erwin Brandstetter I would have thought that the predicates involved would have been able to effectively use the index. Is it because the query plan generated is generic, and doesn't understand the values of the parameters? Should I have included this in my original question? > and only relatively few rows have a "processed" state, replace it with a partial index: I tried this and didn't get any better results. I may have done something wrong. I will read the provided links.
Postgres can use estimates for array elements in the input, with some limitations. A prepared statement is more limited, but it can still adapt the plan to the input. Postgres 16 is a lot better at this than Postgres 12, though. I added a bit above
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.