Context:
- Postgres 15
- table hundreds of millions or maybe billions of rows, the tables contain an id colummn and a JSONB column
- no other constraints (FKs)
- selecting from the table is usually/often happens by searching for content of JSON fields
- one JSONB cell is around 5 kB in our case
- not using Postgres is out of the question
The problem: Sometime the table is under heavy upsert pressure (INSERT ON CONFLICT), in these cases we need high throughput, so we'd like to use concurrent upserts. The issue is that in order to facilitate searching in the table, we use a GIN index on the JSONB column. Maintaining this index seems to put limits on how much we can scale upsert performance.
- By default, Postgres is using
fastupdateto allow fast upserts and defers GIN maintenance until a bunch of non-indexed tuples accumulate. The default setting for this is 4 MB. We see that the p50 and p95 upsert duration is <15 msec, but under high load every 20-40 seconds we see a 10 second pause during which nothing can insert into the table. We think this is due GIN index maintenance, i.e. processing pending non-indexed tuples. Throughput averages out to ~ 80-100 upserts/second independent of thread count. - If we turn
fastupdate=offthen things are better; upserts get somewhat slower but under high load there's still a limit on concurrency, so we can't scale up upserts to above 2-3, which limits the throughput to ~100 upserts or so.
It is OK for us if reads from this table become a bit slower, so I was thinking that we could use a GIST index instead, but I don't know how to do it properly.
> create index concurrently gist_idx on thetablename using gist(thejsoncolumn);
ERROR: data type jsonb has no default operator class for access method "gist"
HINT: You must specify an operator class for the index or define a default operator class for the data type.
AFAIK GIST index maintenance is also more concurrency-friendly as upserts into a table with GIN need to lock multiple rows in the index.
Questions:
- does my GIST idea look sensible?
- how should I create a GIST index on the JSONB column (both keys and values are searched for, so something similar to
json_opswould be OK) - does someone have experience with how
fastupdate=offbehaves under concurrent load? - maybe we should set
gin_pending_list_limitto something very high (gigabytes) then let autovacuum process it in the background and until then we'd get lower SELECT performance?
Thanks.