0

An app that I'm working with runs a query like this:

SELECT
   "categories"."id" 
FROM
   "categories" 
WHERE
(
      ('f' ) 
      OR categories.id NOT IN 
      (
         SELECT
            category_id 
         FROM
            category_groups
      )
);

This query takes an extremely long time to finish, and I haven't seen it finish yet. Our "categories" table has 65947 rows and our "category_groups" table has 131,780 rows.

When running "EXPLAIN" on this query, it says that this query will "cost" a ton of work (cost=1000.29..109033834.49).

I edited this query to remove that ('f') OR section of the clause, as shown in the below query:

SELECT
   "categories"."id" 
FROM
   "categories" 
WHERE
(
      categories.id NOT IN 
      (
         SELECT
            category_id 
         FROM
            category_groups
      )
);

This query finishes very quickly, and it's "cost" when EXPLAIN-ing it is much lower (6283.94..10190.09).

Similarly, if I replace the ('f') OR with ('t') OR, the query completes quickly, and the cost goes down (back to 6283.94..10190.09).

Why would adding that ('f') OR clause damage this query's performance so much?

Edit: Here's the full EXPLAIN (VERBOSE) for the query with ('f') OR

 Gather  (cost=1000.29..109033834.49 rows=32952 width=4)
   Output: categories.id
   Workers Planned: 1
   ->  Parallel Index Only Scan using categories_pkey on public.categories  (cost=0.29..109029539.29 rows=19384 width=4)
         Output: categories.id
         Filter: (NOT (SubPlan 1))
         SubPlan 1
           ->  Materialize  (cost=0.00..5295.43 rows=131762 width=4)
                 Output: category_groups.category_id
                 ->  Seq Scan on public.category_groups  (cost=0.00..4121.62 rows=131762 width=4)
                       Output: category_groups.category_id

And here's the full explain for the query without ('f') OR:

 Hash Anti Join  (cost=6283.94..10190.09 rows=1131 width=4)
   Output: categories.id
   Hash Cond: (categories.id = category_groups.category_id)
   ->  Index Only Scan using categories_pkey on public.categories  (cost=0.29..2213.44 rows=65903 width=4)
         Output: categories.id
   ->  Hash  (cost=4121.62..4121.62 rows=131762 width=4)
         Output: category_groups.category_id
         ->  Seq Scan on public.category_groups  (cost=0.00..4121.62 rows=131762 width=4)
               Output: category_groups.category_id
9
  • is category_groups.category_id nullable? Commented Mar 10, 2022 at 20:28
  • 1
    What's the intention behind writing where false or ...? But in general, OR conditions are hard to optimize (for any DBMS, not just Postgres) Commented Mar 10, 2022 at 20:28
  • category_groups.category_id is not nullable. @SalmanA Commented Mar 10, 2022 at 20:31
  • 2
    Can we see the full explain, please? What is the purpose of the ('f' ) in the query? Commented Mar 10, 2022 at 21:06
  • 2
    @BrianK I'm surprised it's that slow with only 32952 rows. Have you analyzed the tables? Note that where 'f' or X is just X. 'f', in boolean context, is false; the f does nothing but confuse the optimizer. Commented Mar 10, 2022 at 21:14

1 Answer 1

3

The FALSE OR prevents PostgreSQL from optimizing your NOT IN condition to an anti-join. The reason is that the PostgreSQL optimizer doesn't think hard enough for that. However, since it is easy to rewrite the query to do better (simple remove the FALSE OR), there is little reason to make the optimizer smarter. A smart optimizer is slow, and the gain of optimizing an ill-written query better does not outweigh the price that everyone would have to pay.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.