2

I'm working on incorporating full-text search into my app. In the production version, a user will enter a search phrase which will be searched against 10M+ rows in a table. I'm currently testing it out with a subset of that data (~800k rows) and having some speed issues. When I run this query:

SELECT title, ts_rank_cd(title_abstract_tsvector, to_tsquery('english','cancer'), 4) AS rank
FROM test_search_articles 
WHERE title_abstract_tsvector @@ to_tsquery('cancer') 
ORDER BY rank LIMIT 50

where 'cancer' is the search term, 25-30 seconds. However, when I change the ORDER BY from rank to id like below:

SELECT title, ts_rank_cd(title_abstract_tsvector, to_tsquery('english','cancer'), 4) AS rank 
FROM test_search_articles 
WHERE title_abstract_tsvector @@ to_tsquery('cancer') 
ORDER BY id LIMIT 50

the query takes <1sec. I'm confused why changing the ORDER BY accounts for such a huge change in query speed, especially given that rank is returned in both. Could anyone help me understand this and what to do to make the original query faster? Not sure if it's relevant, but I'm currently using a GIN index on my tsvector column (title_abstract_tsvector).

EDIT: Running either query without the LIMITs takes 25-30 seconds, answering my question about why ORDER BY id matters. AS for how to speed the first query, I'm still looking for a solution

EDIT 2: Create Index statements

CREATE UNIQUE INDEX test_search_articles_pkey ON public.test_search_articles USING btree (id)

CREATE INDEX article_idx ON public.test_search_articles USING gin (title_abstract_tsvector)

Execution Plan

"Gather  (cost=1679.97..177072.34 rows=71706 width=103) (actual time=43.963..28084.129 rows=72111 loops=1)"
"  Workers Planned: 2"
"  Workers Launched: 2"
"  Buffers: shared hit=194957 read=97049"
"  I/O Timings: read=80499.893"
"  ->  Parallel Bitmap Heap Scan on test_search_articles  (cost=679.97..168901.74 rows=29878 width=103) (actual time=15.580..28008.573 rows=24037 loops=3)"
"        Recheck Cond: (title_abstract_tsvector @@ to_tsquery('cancer'::text))"
"        Heap Blocks: exact=16483"
"        Buffers: shared hit=194957 read=97049"
"        I/O Timings: read=80499.893"
"        ->  Bitmap Index Scan on article_idx  (cost=0.00..662.04 rows=71706 width=0) (actual time=27.719..27.720 rows=72111 loops=1)"
"              Index Cond: (title_abstract_tsvector @@ to_tsquery('cancer'::text))"
"              Buffers: shared hit=1 read=20"
"              I/O Timings: read=11.768"
"Planning Time: 12.145 ms"
"Execution Time: 28104.318 ms"

EDIT 3:

select pg_relation_size('test_search_articles'): 2176933888

select pg_table_size('test_search_articles'): 4283850752

pg_column_size of title_abstract_tsvector of entire table: 1343.5673777677141794

pg_column_size of title_abstract_tsvector of rows matching 'cancer' query: 1576.1418923603888450

EDIT 4 Vacuum output message: INFO: vacuuming "public.test_search_articles"

INFO: "test_search_articles": found 0 removable, 1003125 nonremovable row versions in 265739 pages

DETAIL: 0 dead row versions cannot be removed yet.

CPU: user: 31.15 s, system: 10.38 s, elapsed: 126.14 s.

INFO: analyzing "public.test_search_articles"

INFO: "test_search_articles": scanned 30000 of 161999 pages, containing 185588 live rows and 0 dead rows; 30000 rows in sample, 1002169 estimated total rows

VACUUM

4
  • Please edit your question and add the execution plan of both queries generated using explain (analyze, buffers, format text) (not just a "simple" explain) as formatted text and make sure you preserve the indention of the plan. Paste the text, then put ``` on the line before the plan and on a line after the plan. Please also include complete create index statements for all indexes as well. Commented Feb 15, 2021 at 22:20
  • @a_horse_with_no_name Added the execution plan and the create index statements. Let me know if you need anything else. Commented Feb 15, 2021 at 22:54
  • Since this is a test system, can you do a VACUUM FULL of the table, followed by a VACUUM ANALYZE, and see if that changes anything? I wonder if you have a lot of bloat and/or fragmentation. Commented Feb 18, 2021 at 17:09
  • @jjanes I misread your ask and ran a VACUUM FULL ANALYZE. I'm putting the output as Edit 4 above. Commented Feb 18, 2021 at 18:54

2 Answers 2

6

Quite often the slow search query is useless, which means optimizing it is a waste of time.

If there are 24037 records matching "cancer" then searching on this single term will never return relevant results to the user. Therefore it is pointless to sort by relevance. What could be more useful would be a bunch of heuristics, for example: if the user enters only one search term, display most recent articles about this term (fast) and maybe offer a list of keywords often related to this term. Then, switch to "ORDER BY rank" only when the user enters enough search keywords to produce a meaningful rank. This way you can implement a search that is not just faster, but also more useful.

Maybe you will say, "but if I type a single word into google I get relevant results!" and... yes, of course, but that's because google knows everything you do, it always has context, and if you don't enter extra search terms, it will do it for you.

Sign up to request clarification or add additional context in comments.

Comments

2

Essentially all your time is spent on IO. So the main thing you can do is get faster IO, or more RAM so you can cache more of the data.

The fact that your buffers read is 6 times greater than your exact heap blocks suggests that your title_abstract_tsvector is so large that it has been TOASTED and now needs to reassembled from multiple pages in order to be used in the computation of the rank function. Is that plausible? How large is that column on average?

Are you already saturating your disk capacity? If not you could try to get a larger degree of parallelization by, for example, increasing max_parallel_workers_per_gather.

But the main thing you can do is just not run that query very much. Either don't let users run such a non-specific query, or precompute and store the results of single-term queries so they can be returned without recomputing them.

3 Comments

Thanks for the detailed response! I don't think that the title_abstract_tsvector is prohibitively large, but I'm not sure. On average, that column contains somewhere between 100-300 terms. That tsvector column is made from the Title and Abstract columns, so a total of <1k words usually. As for the frequency of use, I'm trying to create search functionality where a user can enter a query and get relevant results based on article title and abstracts. Seems pretty reasonable, so I feel there has to be a to search in a faster way.
The numbers aren't really adding up for me. Can you check the sizes with pg_relation_size and pg_table_size (size not including TOAST, and including TOAST)? And maybe get some stats using avg(pg_column_size( title_abstract_tsvector)) for the whole table, and for the part that matches 'cancer'?
I added the information you requested above under Edit 3. I also updated the execution plan since I've added more rows since posting the question. Appreciate the help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.