1

I have a postgres 13.3 table that looks like the following:

CREATE TABLE public.enrollments (
    id bigint NOT NULL,
    portfolio_id bigint NOT NULL,
    consumer_id character varying(255) NOT NULL,
    identity_id character varying(255) NOT NULL,
    deleted_at timestamp(0) without time zone,
    batch_replace boolean DEFAULT false NOT NULL
);
CREATE UNIQUE INDEX enrollments_portfolio_id_consumer_id_index ON public.enrollments 
  USING btree (portfolio_id, consumer_id) WHERE (deleted_at IS NULL);

Each portfolio typically contains many millions of enrollments. My customers typically send me a batch file on a regular basis that contains all of their enrollments so I have to make the db match this file. I try to read a chunk of about 1000 at a time and then query to check if enrollments are preexisting or not with a query such as the following:

SELECT * FROM enrollments WHERE deleted_at IS NULL AND portfolio_id = 1 
  AND consumer_id = ANY(ARRAY["C1", "C2", ..., "C1000"])

It appears that for a new portfolio, it doesn't use the unique partial index so this query can take up to 30 seconds. When there are already several million enrollments in the portfolio, the index seems to work and takes around 20ms. I've had to change the sql to just query one enrollment at a time which takes about 1sec/1000. This isn't ideal as it can take up to a day to finish a file but at least it finishes.

Does anybody know what I can do to get the unique partial index to be used consistently when using many consumer_ids in the select?

Below is some explain output. The lengthy query took a little over 4 seconds and this increases up to at least 30 as more and more enrollments get inserted into the portfolio until it gets to some point and drops to about 20ms

Existing enrollments in this portfolio: 78140485

Index Scan using enrollments_portfolio_id_consumer_id_index on enrollments e0  (cost=0.70..8637.14 rows=1344 width=75) (actual time=3.529..37.827 rows=1000 loops=1)
  Index Cond: ((portfolio_id = '59031'::bigint) AND ((consumer_id)::text = ANY ('{C1,C2,...,C1000}'::text[])))
  I/O Timings: read=27.280
Planning Time: 0.477 ms
Execution Time: 37.914 ms

Benchmark time: 20 ms


Existing enrollments in this portfolio: 136000

Index Scan using enrollments_portfolio_id_consumer_id_index on enrollments e0  (cost=0.70..8.87 rows=1 width=75) (actual time=76.615..4354.081 rows=1000 loops=1)
  Index Cond: (portfolio_id = '59028'::bigint)
  Filter: ((consumer_id)::text = ANY ('{C1,C2,...,C1000}'::text[]))
  Rows Removed by Filter: 135000
Planning Time: 1.188 ms
Execution Time: 4354.341 ms

Benchmark time: 4398 ms
7
  • 2
    Please edit your question and add both execution plans (the fast and the slow) generated using explain (analyze, buffers, format text) (not just a "simple" explain) as formatted text and make sure you preserve the indention of the plan. Paste the text, then put ``` on the line before the plan and on a line after the plan. Commented Oct 13, 2021 at 11:59
  • 1
    What column holds the most unique values, portfolio_id or consumer_id? Your index is optimised for a situation where portfolio_id holds the most unique values. Your query could benefit from an index where you first use consumer_id and second portfolio_id. But you have to check, and without a query plan it's just a guess from my side. Commented Oct 13, 2021 at 12:09
  • Interesting question +1. Please include the execution plans. The optimizer may be trying to be "too smart". @FrankHeikens Thinking the same thing. Commented Oct 13, 2021 at 13:46
  • @FrankHeikens I may have the order wrong, this is definitely not my area of expertise. I thought I should use portfolio_id first as I also have queries that don't include the consumer_id for getting counts/portfolio and setting a mark flag at the beginning so I can delete the non-existing enrollments at the end. Commented Oct 13, 2021 at 15:51
  • Both queries are using the same index, enrollments_portfolio_id_consumer_id_index. Is that the index you created for this purpose? Because in your question you mention a different name. This is your problem: Rows Removed by Filter: 135000 Commented Oct 13, 2021 at 18:35

1 Answer 1

1

The thing here that is actually slow is that =ANY is implemented by looping over the 1000 members of your array and testing each one, and doing that for each of the 136000 rows it needs to inspect. That is a lot looping (but not 4 seconds worth in my hands, "only" 1.5s for me). Worse, the planner doesn't anticipate that the =ANY has such a poor implementation, so sees no reason to choose the other plan to avoid it.

v14 will fix this by using a hash table to implement the =ANY, so it will no longer be so inordinately slow.

If you can't/don't want to upgrade to v14, you could rewrite the query by joining to a VALUES list, rather than using =ANY

SELECT * FROM enrollments join (VALUES ('C1'),...,('C1000')) f(c) on c=consumer_id
  WHERE deleted_at IS NULL AND portfolio_id = 1 
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! That sped things up quite a bit. At 1M rows, the query now takes about 1.3 seconds vs about 33 seconds for the old one. That's at least usable until the other plan kicks in. I'm still trying to figure out when that happens.
In my hands the cutoff is around 1200. But the thing is, it will not know you have exceeded that until after an analyze happens to get new stats.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.