0

We’re experiencing frequent long-running queries (>43 secs) in our PostgreSQL production DB, and they often get stuck on:

wait_event_type = LWLock
wait_event = BufferMapping

This seems to indicate contention on shared buffers. The queries are usually simple SELECTs (e.g., on the level_asr_asrlog table) but during peak usage, they slow down drastically and sometimes get auto-killed after 60 seconds.[based on statement_timeout]

Instance Configuration:
PostgreSQL version: 14
RAM: 104 GB (≈95 GB usable for Postgres)
vCPUs: 16
SSD Storage: GCP auto-scaled from 10TB → 15TB over the last year
Shared Buffers: 34.8 GB
work_mem: 4 MB
maintenance_work_mem: 64 MB
autovacuum_work_mem: -1 [I think this means its equal to maintenance_work_mem]
temp_buffers: 8 MB
effective_cache_size: ~40 GB
max_connections: 800

Observations

VACUUM processes often take >10 minutes.
Memory is almost fully utilized (free memory <5%)
CPU spikes >95% and correlates with memory pressure.
The system appears to be thrashing — swapping data instead of doing useful work.
The wait event BufferMapping implies the backend is stuck trying to 
associate a block with a buffer, likely due to memory contention.

I need help with below things,

  • How to further diagnose LWLock:BufferMapping contention?
  • Is increasing work_mem or shared_buffers a safe direction?
  • Should I implement PgBouncer to reduce max_connections impact on memory?
  • How to confirm if the OS is thrashing, and if so, how to resolve it?

3 Answers 3

1

First measure: see if the disks are overloaded. If yes, get more I/O power and tune your statements.

Second measure: reduce shared_buffers to 16GB or 8GB.

Third measure: reduce the number of connections with a reasonably sized connection pool (you don't have to reduce max_connections)

1

maintenance_work_mem is very low for a 104 GB database server that has 15TB storage. I would change this one and give it at least a few GB of RAM.

But also check the rest of your configuration and statistics. I expect your current observations are just the tip of the iceberg regarding performance issues.

-2

This issue might arise from I/O-heavy operations like sequential scans or VACUUM, which load new pages into shared buffers—causing queries that would otherwise run smoothly to appear slow.. Use a monitoring tool or GCP portal to check buffer usage and I/O during the spike. Also, review slow query logs to identify and tune problematic queries.

Since you have 95GB for Postgres, set effective_cache_size accordingly to improve query planning. Increasing maintenance_work_mem or autovacuum_work_mem to 1–2GB can boost vacuum performance.

Implementing pgbouncer with transaction pooling is a solid move—just ensure your app doesn’t rely on session-level features.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.