-1

I have a local PG instance populated with 10's of millions of rows of data and need to run some relatively complex queries for some data analysis work.

These queries are currently taking 10+ minutes to return (even after adding custom index's for every query I'm running).

While this may not be the "right tool for the job", I'm sure that my system isn't fully utilizing its available resources. I've configured my system using PGTune, but it still seems like this is taking into account a margin of safety for application stability, multiple connections, competing processes, etc.

If I just want PG to run as fast as it possibly can for my single connection...

What are the most important settings? And how should they be configured, relative to the system specs?

(Mine are 8-core, 32GB ram for example)

10
  • Disagree with the close reason. PG falls well into "software tools primarily used by programmers." Commented Sep 30, 2024 at 2:55
  • Right. The correct close reason is that you provided too little information. Commented Sep 30, 2024 at 6:02
  • 1
    "margin of safety for application stability" - if you're willing to sacrifice that, you can consider non-durable settings. Make sure you have a cold backup to go back to and take a look at the rest of performance tips and resource consumption settings. Commented Sep 30, 2024 at 15:25
  • 1
    The manual lists EXPLAIN first for a reason - even a very slight improvement to your schema and the queries in your pipeline can yield disproportionately better results (orders of magnitude) than multiplying your resources, lifting all resource consumption constraints and removing all safety measures. One advantage of those last 3 things is you don't have to read any of the code you're trying to speed up - which might be preferable when dealing with a ton of legacy code to go through. Still, even then, removing just a few most obvious bottlenecks might outweigh all config tweaks. Commented Sep 30, 2024 at 15:46
  • Could you please share the results from explain(analyze, verbose, buffers, settings) for your slow SQL statement, the statement itself and the DDL for all tables and indexes involved? All in plain text, as an update of your question. And for your information, there is no configuration setting "maximize compute resources for a single connection". The best config depends on your hardware, usage pattern and configuration. Commented Sep 30, 2024 at 16:50

2 Answers 2

1

The default for max_parallel_workers_per_gather is 2, which is too low for the situation you describe. It should be set equal to the number of CPUs (actually one less than the number of CPUs, but that isn't likely to make any meaningful difference). But not all queries can benefit from parallel workers, so this might not make much difference to you.

High values for effective_io_concurrency can help if your queries involve bitmap heap scans and if your IO system can benefit from having multiple IO requests in flight at the same time (RAID/JBOD generally can. SSD (even single-drive systems) usually can if it is high quality. Even single-disk HDD can often get some smallish benefit).

Setting effective_cache_size to the same size as all of RAM (or just slightly less) can help for some queries.

Increasing work_mem can help. But be careful not to overdue, even a single session can allocate many multiples of work_mem if it involves sorts in many different executor nodes, or parallel workers, or partitions. Although the "spill to disk" algorithms are now so good that this often doesn't make a big difference.

Sign up to request clarification or add additional context in comments.

1 Comment

This comes the closest to answering the question as asked (How do I squeeze max performance out of my resources)... Although after tweaking things and maxing stuff out, it's clear that really only takes you so far and there is no escaping query optimization.
1

If your statements are slow, you have to find the cause with EXPLAIN (ANALYZE, BUFFERS, SETTINGS) and improve that. Twiddling parameters will achieve less than you think.

That said, the most important parameter for queries on big tables is work_mem. Set it as high as you can without going out of memory.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.