I have two database servers that are both in "test" mode with one expected to be promoted to the production server. As such, the specs differ and some of the configs, but we are finding that the underpowered server produces better query plans and thus, faster queries.
Stats:
Both systems have roughly the same data and the amount looks like this:
Size | Part
--------------------
1.47 TB | Entire DB
871 GB | Tables
635 GB | Indexes
The bigger db server has the following specs:
RAM: 500 GB
CPU: 16 Cores 2.0 GHz Intel
Using SSDs
Postgres 10.0
Memlock set to reserve 485 GB specifically to Postgres
Postgres Settings:
shared_buffers: 125 GB
work_mem: 36 MB
effective_cache_size: 300 GB
random_page_cost: 1
default_statistics_target: 1000
Query Plan: https://explain.depesz.com/s/9Ww6
The smaller server has the following stats:
RAM: 281 GB
CPU: 4 Cores 2.0 GHz Intel
Using SSDs
Postgres 10.0
Memlock set to reserve 240 GB specifically to Postgres
Postgres Settings:
shared_buffers: 50 GB
work_mem: 25.6 MB
effective_cache_size: 150 GB
random_page_cost: 4
default_statistics_target: 100
Query Plan: https://explain.depesz.com/s/4WUH
We've tried switching random_page_cost, default statistics (followed by analyze) and work memory to match each other. The biggest gain came after running a vacuum full on all tables in the query.
Workload: This machine is a read replica used to extract files of the data as XML files, etc. So it receives replicated data and has a fairly heavy read load.
Question: What should I be looking for to make this query more performant on the larger server where it runs slower? Ideally this query runs much faster than it does on the smaller server. It appears as we have scaled we have failed to correctly set the settings to take advantage of our hardware. There must be something I'm overlooking.
Edit: I put the non-obfuscated plans up. I have also tried bumping statistics up to 3000 from 1000 and it didn't help the plans. Same goes for changing the random_page_cost to match between the servers.