I'm working on incorporating full-text search into my app. In the production version, a user will enter a search phrase which will be searched against 10M+ rows in a table. I'm currently testing it out with a subset of that data (~800k rows) and having some speed issues. When I run this query:
SELECT title, ts_rank_cd(title_abstract_tsvector, to_tsquery('english','cancer'), 4) AS rank
FROM test_search_articles
WHERE title_abstract_tsvector @@ to_tsquery('cancer')
ORDER BY rank LIMIT 50
where 'cancer' is the search term, 25-30 seconds. However, when I change the ORDER BY from rank to id like below:
SELECT title, ts_rank_cd(title_abstract_tsvector, to_tsquery('english','cancer'), 4) AS rank
FROM test_search_articles
WHERE title_abstract_tsvector @@ to_tsquery('cancer')
ORDER BY id LIMIT 50
the query takes <1sec. I'm confused why changing the ORDER BY accounts for such a huge change in query speed, especially given that rank is returned in both. Could anyone help me understand this and what to do to make the original query faster? Not sure if it's relevant, but I'm currently using a GIN index on my tsvector column (title_abstract_tsvector).
EDIT: Running either query without the LIMITs takes 25-30 seconds, answering my question about why ORDER BY id matters. AS for how to speed the first query, I'm still looking for a solution
EDIT 2: Create Index statements
CREATE UNIQUE INDEX test_search_articles_pkey ON public.test_search_articles USING btree (id)
CREATE INDEX article_idx ON public.test_search_articles USING gin (title_abstract_tsvector)
Execution Plan
"Gather (cost=1679.97..177072.34 rows=71706 width=103) (actual time=43.963..28084.129 rows=72111 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" Buffers: shared hit=194957 read=97049"
" I/O Timings: read=80499.893"
" -> Parallel Bitmap Heap Scan on test_search_articles (cost=679.97..168901.74 rows=29878 width=103) (actual time=15.580..28008.573 rows=24037 loops=3)"
" Recheck Cond: (title_abstract_tsvector @@ to_tsquery('cancer'::text))"
" Heap Blocks: exact=16483"
" Buffers: shared hit=194957 read=97049"
" I/O Timings: read=80499.893"
" -> Bitmap Index Scan on article_idx (cost=0.00..662.04 rows=71706 width=0) (actual time=27.719..27.720 rows=72111 loops=1)"
" Index Cond: (title_abstract_tsvector @@ to_tsquery('cancer'::text))"
" Buffers: shared hit=1 read=20"
" I/O Timings: read=11.768"
"Planning Time: 12.145 ms"
"Execution Time: 28104.318 ms"
EDIT 3:
select pg_relation_size('test_search_articles'): 2176933888
select pg_table_size('test_search_articles'): 4283850752
pg_column_size of title_abstract_tsvector of entire table: 1343.5673777677141794
pg_column_size of title_abstract_tsvector of rows matching 'cancer' query: 1576.1418923603888450
EDIT 4 Vacuum output message: INFO: vacuuming "public.test_search_articles"
INFO: "test_search_articles": found 0 removable, 1003125 nonremovable row versions in 265739 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU: user: 31.15 s, system: 10.38 s, elapsed: 126.14 s.
INFO: analyzing "public.test_search_articles"
INFO: "test_search_articles": scanned 30000 of 161999 pages, containing 185588 live rows and 0 dead rows; 30000 rows in sample, 1002169 estimated total rows
VACUUM
explain (analyze, buffers, format text)(not just a "simple" explain) as formatted text and make sure you preserve the indention of the plan. Paste the text, then put```on the line before the plan and on a line after the plan. Please also include completecreate indexstatements for all indexes as well.