How to make big postgres db faster?

Question

I have big Postgres database(around 75 GB) and queries are very slow. Is there any way to make them faster?

About database:

                                             List of relations
 Schema |       Name        |   Type   |  Owner   | Persistence | Access method |    Size    | Description
--------+-------------------+----------+----------+-------------+---------------+------------+-------------
 public | fingerprints      | table    | postgres | permanent   | heap          | 35 GB      |
 public | songs             | table    | postgres | permanent   | heap          | 26 MB      |
 public | songs_song_id_seq | sequence | postgres | permanent   |               | 8192 bytes |

\d+ fingerprints
                                                    Table "public.fingerprints"
    Column     |            Type             | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
---------------+-----------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 hash          | bytea                       |           | not null |         | extended |             |              |
 song_id       | integer                     |           | not null |         | plain    |             |              |
 offset        | integer                     |           | not null |         | plain    |             |              |
 date_created  | timestamp without time zone |           | not null | now()   | plain    |             |              |
 date_modified | timestamp without time zone |           | not null | now()   | plain    |             |              |
Indexes:
    "ix_fingerprints_hash" hash (hash)
    "uq_fingerprints" UNIQUE CONSTRAINT, btree (song_id, "offset", hash)
Foreign-key constraints:
    "fk_fingerprints_song_id" FOREIGN KEY (song_id) REFERENCES songs(song_id) ON DELETE CASCADE
Access method: heap

\d+ songs
                                                                       Table "public.songs"
    Column     |            Type             | Collation | Nullable |                Default                 | Storage  | Compression | Stats target | Description
---------------+-----------------------------+-----------+----------+----------------------------------------+----------+-------------+--------------+-------------
 song_id       | integer                     |           | not null | nextval('songs_song_id_seq'::regclass) | plain    |             |              |
 song_name     | character varying(250)      |           | not null |                                        | extended |             |              |
 fingerprinted | smallint                    |           |          | 0                                      | plain    |             |              |
 file_sha1     | bytea                       |           |          |                                        | extended |             |              |
 total_hashes  | integer                     |           | not null | 0                                      | plain    |             |              |
 date_created  | timestamp without time zone |           | not null | now()                                  | plain    |             |              |
 date_modified | timestamp without time zone |           | not null | now()                                  | plain    |             |              |
Indexes:
    "pk_songs_song_id" PRIMARY KEY, btree (song_id)
Referenced by:
    TABLE "fingerprints" CONSTRAINT "fk_fingerprints_song_id" FOREIGN KEY (song_id) REFERENCES songs(song_id) ON DELETE CASCADE
Access method: heap

DB Scheme DB Amount

No need to write to database, only read. All queries are very simple:

SELECT song_id 
WHERE hash in fingerpints = X

 EXPLAIN(analyze, buffers, format text) SELECT "song_id", "offset" FROM "fingerprints" WHERE "hash" = decode('eeafdd7ce9130f9697','hex');
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using ix_fingerprints_hash on fingerprints  (cost=0.00..288.28 rows=256 width=8) (actual time=0.553..234.257 rows=871 loops=1)
   Index Cond: (hash = '\xeeafdd7ce9130f9697'::bytea)
   Buffers: shared hit=118 read=749
 Planning Time: 0.225 ms
 Execution Time: 234.463 ms
(5 rows)

234 ms looks fine where it is one query. But in reality there 3000 query per time, that takes about 600 seconds. It is audio recognition application, so algoritm works like that.

About indexes:

CREATE INDEX  "ix_fingerprints_hash" ON "fingerprints" USING hash ("hash");

For pooler I use Odyssey.

Little bit of info from config:

shared_buffers = 4GB   
huge_pages = try
work_mem = 582kB
maintenance_work_mem = 2GB  
effective_io_concurrency = 200
max_worker_processes = 24
max_parallel_workers_per_gather = 12
max_parallel_maintenance_workers = 4
max_parallel_workers = 24
wal_buffers = 16MB
checkpoint_completion_target = 0.9
max_wal_size = 16GB
min_wal_size = 4GB
random_page_cost = 1.1
effective_cache_size = 12GB

Info about hardware:

Xeon 12 core (24 threads)
RAM DDR4 16 GB ECC
NVME disk

Will the database be accelerated by purchase more RAM to handle all DB inside (128 GB in example)? And what parameters should I change to say to Postgres to store db in ram?

I read about several topics about pg_tune, etc. but experiments don't show any good results.

This might be a better fit for Database Administrators. But even there, please post your schema as text, not a link to a painting of the code. — Bergi
– Bergi, Commented Dec 23, 2022 at 19:29
How often are rows updated? I ask this because, if the rows never update, you could start caching the results in a Redis or MemCached, that way that select would take less than 1ms. — GabrielNexT
– GabrielNexT, Commented Dec 23, 2022 at 20:42
do you store files in your database (column hash) ? what file size ? this could slow down the database — Edouard
– Edouard, Commented Dec 23, 2022 at 20:57
NVME disks are reputed to be very fast, but they don't seem all that fast here. Do you have specs on what the drives should deliver (in terms of IOPS, not sequential read throughput) and can you test that they deliver that with some benchmarking tool outside of PostgreSQL? — jjanes
– jjanes, Commented Dec 23, 2022 at 22:40

jjanes · Accepted Answer · 2022-12-24 00:04:07Z

Increasing the RAM so that everything can stay in cache (perhaps after using pg_prewarm to get it into cache in the first place) would certainly work. But it is expensive and shouldn't be necessary.

Having a hash index on something which is already a hashed value is probably not very helpful. Have you tried just a default (btree) index instead?

If you CLUSTER the table on the index over the column named "hash" (which you can only do if it is a btree index) then rows with the same hash code should mostly share the same table page, which would greatly cut down on the number of different buffer reads needed to fetch them all.

If you could get it do a bitmap heap scan instead of an index scan, then it should be able to have a large number of read requests outstanding at a time, due to effective_io_concurrency. But the planner does not account for effective_io_concurrency when doing planning, which means it won't choose a bitmap heap scan specifically to get it that benefit. Normally an index read finding hundreds of rows on different pages would automatically choose a bitmap heap scan method, but in your case it is probably the low setting of random_page_cost which is inhibiting it from doing so. The low setting of random_page_cost is probably reasonable in itself, but it does have this unfortunate side effect. A problem with this strategy is that it doesn't reduce the overall amount of IO needed, it just allows them overlap and so make better use of multiple IO channels. But if many sessions are running many instances of this query at the same time, they will start filling up those channels and so start competing with each other. So the CLUSTER method is probably superior as it gets the same answer with less IO. If you want to play around with bitmap scans, you could temporarily increase random_page_cost or temporarily set enable_indexscan to off.

Erwin Brandstetter · Accepted Answer · 2022-12-26 03:18:10Z

1

No need to write to database, only read.

So the DB is read-only.
And in comments:

db worked fine on small amount of data(few GB), but after i filled out database started to slowdown.

So indexes have been built up incrementally.

Indexes

`UNIQUE CONSTRAINT` on `(song_id, "offset", hash)`

I would replace that with:

ALTER TABLE fingerprints
  DROP CONSTRAINT uq_fingerprints
, ADD CONSTRAINT uq_fingerprints UNIQUE(hash, song_id, "offset") WITH (FILLFACTOR = 100)

This enforces the same constraint, but the leading hash column in the underlying B-tree index now supports the filter on hash in your displayed query. And the fact that all needed columns are included in the index, further allows much faster index-only scans. The (smaller) index should also be more easily cached than the (bigger) table (plus index).

See:

Is a composite index also good for queries on the first field?

Also rewrites the index in pristine condition, and with FILLFACTOR 100 for the read-only DB. (Instead of the default 90 for a B-tree index.)

Hash index on `(hash)` and `CLUSTER`

The name of the column "hash" has nothing to do with the name of the index type, which also happens to be "hash". (The column should probably not be named "hash" to begin with.)

If (and only if) you also have other queries centered around one of few hash values, that cannot use index-only scans (and you actually see faster queries than without) keep the hash index, additionally. And optimize it. (Else drop it!)

ALTER INDEX ix_fingerprints_hash SET (FILLFACTOR = 100);

An incrementally grown index may end up with bloat or unbalanced overflow pages in case of a hash index. REINDEX should take care of that. While being at it, increase FILLFACTER to 100 (from the default 75 for a hash index) for your read-only (!) DB. You can REINDEX to make the change effective.

REINDEX INDEX ix_fingerprints_hash;

Or you can CLUSTER (like jjanes already suggested) on the rearranged B-tree index from above:

CLUSTER fingerprints USING uq_fingerprints;

Rewrites the table and all indexes; rows are physically sorted according to the given index, so "clustered" around the leading column(s). Effects are permanent for your read-only DB. But index-only scans do not benefit from this.

When done optimizing, run once:

VACUUM ANALYZE fingerprints;

`work_mem`

The tiny setting for work_mem stands out:

work_mem = 582kB

Even the (very conservative!) default is 4MB. But after reading your question again, it would seem you only have tiny queries. So maybe that's ok after all.

Else, with 16GB RAM you can typically afford a 100 times as much. Depends on your work load of course.

Many small queries, many parallel workers --> keep small work_mem (like 4MB?)
Few big queries, few parallel workers --> go high (like 256MB? or more)

Large amounts of temporary files written in your database over time, and mentions of "disk" in the output of EXPLAIN ANALYZE would indicate the need for more work_mem.

Additional questions

Will the database be accelerated by purchase more RAM to handle all DB inside (128 GB in example)?

More RAM almost always helps until the whole DB can be cached in RAM and all processes can afford all the work_mem they desire.

And what parameters should I change to say to Postgres to store db in ram?

Everything that's read from the database is cached automatically in system cache and Postgres cache, up to the limit of available RAM. (Setting work_mem too high competes for that same resource.)

edited Dec 26, 2022 at 3:18

answered Dec 23, 2022 at 22:32

Erwin Brandstetter

668k159 gold badges1.2k silver badges1.3k bronze badges

6 Comments

unbrokendub Over a year ago

Wow. After CLUSTER database works like a charm, looks like 10 times faster.

Erwin Brandstetter Over a year ago

@unbrokendub I assume you rearranged the unique coctraint first? Do you see index-only scans for the sample qiery now?

unbrokendub Over a year ago

Yes, i first rearranged constraints, and after that make cluster. Where i can view that there index only scans?

unbrokendub Over a year ago

This is the Explain output: ` Index Only Scan using uq_fingerprints on fingerprints (cost=0.57..61.71 rows=1665 width=8) (actual time=0.312..0.323 rows=33 loops=1) Index Cond: (hash = '\x5cfb91912e6cc544ab'::bytea) Heap Fetches: 0 Buffers: shared hit=4 read=1 Planning Time: 0.121 ms Execution Time: 0.350 ms`. And want to say thanks for great explanation what to do with database to accelarate it. After reading some documentation, now i understand why CLUSTER make db so fast.

Erwin Brandstetter Over a year ago

For the given query, the index-only scan after rearranging the unique constraint makes all the difference. CLUSTER is unrelated. I clarified my explanation in the answer.

|

Collectives™ on Stack Overflow

How to make big postgres db faster?

2 Answers 2

Comments

Indexes

`UNIQUE CONSTRAINT` on `(song_id, "offset", hash)`

Hash index on `(hash)` and `CLUSTER`

`work_mem`

Additional questions

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Indexes

UNIQUE CONSTRAINT on (song_id, "offset", hash)

Hash index on (hash) and CLUSTER

work_mem

Additional questions

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related

`UNIQUE CONSTRAINT` on `(song_id, "offset", hash)`

Hash index on `(hash)` and `CLUSTER`

`work_mem`