Using Postgres, I want to perform a full text search that includes an array column, using an index. Let's start with a hypothetical schema:
CREATE TABLE book (title TEXT, tags TEXT[]);
-- tags are lowercase a-z, dashes, and $
We want a query that searches title and tags for some text. A straightforward query with reasonable semantics, recommended by many SO answers, is:
SELECT *
FROM book
WHERE to_tsvector('simple', array_to_string(tags, ' ')) || to_tsvector('simple', title)
@@ to_tsquery('simple', 'mysearchterm');
That works. The dashes and dollar signs in tags effectively no longer exist but that's ok for this application. However, we have millions of records and need an index:
CREATE INDEX book_fulltext_idx
ON book using GIN
((to_tsvector('simple', array_to_string(tags, ' ')) || to_tsvector('simple', title)));
Uh oh! This fails because array_to_string is not IMMUTABLE. There are SO answers that suggest wrapping array_to_string in an immutable function:
CREATE FUNCTION my_array_to_string(arr ANYARRAY, sep TEXT)
RETURNS text LANGUAGE SQL IMMUTABLE
AS $$
SELECT array_to_string(arr, sep);
$$;
CREATE INDEX book_fulltext_idx
ON book using GIN
((to_tsvector('simple', my_array_to_string(tags, ' ')) || to_tsvector('simple', title)));
The index create works! But it never gets used. EXPLAIN ANALYZE on the above SELECT always results in a sequential scan. Postgres is apparently too smart for this trickery.
Aggregate (cost=4348818.79..4348818.80 rows=1 width=8) (actual time=107489.124..107489.125 rows=1 loops=1)
-> Seq Scan on book (cost=0.00..4348543.45 rows=110135 width=0) (actual time=50.689..107477.408 rows=24641 loops=1)
Filter: ((to_tsvector('simple'::regconfig, my_array_to_string(tags, ' '::text)) || to_tsvector('simple'::regconfig, title)) @@ '''mysearchterm'''::tsquery)"
Rows Removed by Filter: 5354819
Planning Time: 0.144 ms
Execution Time: 107489.157 ms
I'm stumped. Is there some way of improving this?
New strategy: Use array_to_tsvector.
CREATE INDEX book_fulltext_idx
ON book using GIN
((array_to_tsvector(tags) || to_tsvector('simple', title)));
SELECT *
FROM book
WHERE array_to_tsvector(tags) || to_tsvector('simple', title)
@@ to_tsquery('simple', 'mysearchterm');
This works! The index is used! It's fast!
Bitmap Heap Scan on book (cost=2005.04..76150.11 rows=26973 width=147) (actual time=5.281..425.128 rows=946 loops=1)
Recheck Cond: ((array_to_tsvector(tags) || to_tsvector('simple'::regconfig, title)) @@ '''apple'''::tsquery)"
Heap Blocks: exact=790
-> Bitmap Index Scan on book_fulltext_idx (cost=0.00..1998.30 rows=26973 width=0) (actual time=4.468..4.468 rows=957 loops=1)
Index Cond: ((array_to_tsvector(tags) || to_tsvector('simple'::regconfig, title)) @@ '''mysearchterm'''::tsquery)"
Planning Time: 0.113 ms
Execution Time: 425.371 ms
BUT the semantics of the search are problematic. array_to_tsvector interprets the array as raw lexemes. to_tsquery strips out $ and dashes. That means tags with dollar signs or dashes are unsearchable.
-- This can NEVER match the tag `$mysearchterm`
SELECT *
FROM book
WHERE array_to_tsvector(tags) || to_tsvector('simple', title)
@@ to_tsquery('simple', '$mysearchterm');
Is there some way to make this do what I want? Seems like I want something like array_to_tsvector('simple', tags) but that function does not exist.
New strategy: Two indexes and OR
CREATE INDEX book_tags_fulltext_idx
ON book using GIN (array_to_tsvector(tags));
CREATE INDEX book_title_fulltext_idx
ON book using GIN (to_tsvector('simple', title));
SELECT *
FROM book
WHERE array_to_tsvector(tags) @@ '$mysearchterm' OR to_tsvector('simple', title)
@@ to_tsquery('simple', '$mysearchterm');
This produces the right answer in a reasonable amount of time, but breaks the semantics of search. You can't search for titleword tagword. The WHERE clause requires both words in the tags, or both words in the title. No bueno.
Epilogue
It looks like I either need to figure out how to index the tags array concatenated with the title, or somehow munge the values that are passed to array_to_tsvector. I'm not quite sure how to do either of those things. Any ideas?
We're using PG11, but I can upgrade if it will make a difference.