4

I have a table in which I want to search by a prefix of the primary key. The primary key has values like 03.000221.1, 03.000221.2, 03.000221.3, etc. and I want to retrieve all that begin with 03.000221..

My first thought was to filter with LIKE '03.000221.%', thinking Postgres would be smart enough to look up 03.000221. in the index and perform a range scan from that point. But no, this performs a sequential scan.

                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..253626.34 rows=78 width=669)
   Workers Planned: 2
   ->  Parallel Seq Scan on ...  (cost=0.00..252618.54 rows=32 width=669)
         Filter: ((id ~~ '03.000221.%'::text)
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, Deforming true

If I do an equivalent operation using a plain >= and < range, e. g. id >= '03.000221.' and id < '03.000221.Z' it does use the index:

                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using ... on ...  (cost=0.56..8.58 rows=1 width=669)
   Index Cond: ((id >= '03.000221.'::text) AND (id < '03.000221.Z'::text))

But this is dirtier and it seems to me that Postgres should be able to deduce it can do an equivalent index range lookup with LIKE. Why doesn't it?

0

1 Answer 1

8

PostgreSQL will do this if you are build the index with text_pattern_ops operator, or if you are using the C collation.

If you are using some random other collation, PostgreSQL can't deduce much of anything about it. Observe this, in the very common "en_US.utf8" collation.

select * from (values ('03.000221.1'), ('03.0002212'), ('03.000221.3')) f(x) order by x;
      x      
-------------
 03.000221.1
 03.0002212
 03.000221.3

Which then naturally leads to this wrong answer with your query:

select * from (values ('03.000221.1'), ('03.0002212'), ('03.000221.3')) f(id)
    where ((id >= '03.000221.'::text) AND (id < '03.000221.Z'::text))
     id      
-------------
 03.000221.1
 03.0002212
 03.000221.3
Sign up to request clarification or add additional context in comments.

3 Comments

For the given values, collate "C" is probably the best choice
I'm using C.UTF-8, which apparently isn't C enough. Thanks!
@ToniCárdenas I've never understood the difference between C and C.UTF-8. I think maybe C is implemented internally as a special case, while C.UTF-8 is outsourced to glibc. It probably could use the index over C.UTF-8 and get the right answer, it just doesn't know that it could.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.