2

I have inserted a lot of data (more than 2 millions documents) in a table and created an a full text search index using GIN and it works great. I can query the database and retrieve the apropriate documents rapidly.

Regularly, I collect new data that I can insert in the database. What I would like to do is to update my index with the new data only, but I have failed so far. I don't want to drop the index and recreate it because it takes ages to recreate it. I basically would like to do an incremental update of the index. I can do that on the fly when data is being inserted but this is very very slow. I read that creating an index on inserted data was faster (true) so I guessed that updating an index on the new data could be done. But I can't do it so far.

I use postgresql 12.

Can anybody help me, please?

1 Answer 1

3

There is no way to suspend adding values to the index while you load data.

But GIN indexes already have a feature to optimize that: the GIN fast update technique. If you set the gin_pending_list_limit storage parameter to the index to a high value. Once you are done with the bulk load, VACUUM the table to integrate the pending list into the main index.

An alternative approach is to use partitioning and load a partition at once. Then create the index on the partition and attach it to the partitioned table.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks a lot, I am going to try that.
Thanks a lot, I am going to try that. just to make sure: I populate my database using a python script and sqlalchemy. then, I alter my table to create a search_vector of type tsvector and create an index . by running UPDATE post SET search_vector = (to_tsvector(title) || to_tsvector(content)); I generate an index which I can use to query my database. Question: when you say vales get added to the index when you load data, does it mean that next time I run my python script to load new data, the index will automatically get updated?
Yes, no matter how you modify the table, PostgreSQL will always modify the index along with it, keeping everything consistent. But for performance reasons, it keeps these modificatoins in a "penting list" in a GIN index, which is kind of an extra overflow area. Like the stack in the library of books that have been acquired recently and not yet put into the right place. When somebody comes looking for a book, you look in the proper catalog, but you also look throug the stack of new arrivals.
But this pending list is eventually included in the index, isn't? Searches must scan the list of pending entries in addition to searching the regular index only if a query is submitted during the process. ,A query that is submitted once all the data have been imported will only have a look at the regular index. Is that correct?
As soon as the pending list exceeds gin_pending_list_limit (or a VACUUM is run), the pending list is cleared and the actual index is modified (the library books are put on the shelves). Any query will always consult the pending list (it is part of the index), but if it is empty, there is no overhead.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.