Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers
| From | Yura Sokolov |
|---|---|
| Subject | Re: BufferAlloc: don't take two simultaneous locks |
| Date | |
| Msg-id | 3222ea2f83d59fdb1c765cc8fc22271163e0a3dc.camel@postgrespro.ru Whole thread Raw |
| In response to | Re: BufferAlloc: don't take two simultaneous locks (Robert Haas <robertmhaas@gmail.com>) |
| Responses |
Re: BufferAlloc: don't take two simultaneous locks
Re: BufferAlloc: don't take two simultaneous locks |
| List | pgsql-hackers |
В Чт, 21/04/2022 в 16:24 -0400, Robert Haas пишет:
> On Thu, Apr 21, 2022 at 5:04 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> > $ pid=`ps x | awk '/checkpointer/ && !/awk/ { print $1 }'`
> > $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
> >
> > $1 = 16512
> >
> > $ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
> > ...
> > $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
> >
> > $1 = 20439
> >
> > $ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
> > ...
> > $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
> >
> > $1 = 20541
> >
> > It stabilizes at 20541
>
> Hmm. So is the existing comment incorrect?
It is correct and incorrect at the same time. Logically it is correct.
And it is correct in practice if HASH_FIXED_SIZE is set for SharedBufHash
(which is not currently). But setting HASH_FIXED_SIZE hurts performance
with low number of spare items.
> Remember, I was complaining
> about this change:
>
> --- a/src/backend/storage/buffer/freelist.c
> +++ b/src/backend/storage/buffer/freelist.c
> @@ -481,10 +481,10 @@ StrategyInitialize(bool init)
> *
> * Since we can't tolerate running out of lookup table entries, we must be
> * sure to specify an adequate table size here. The maximum steady-state
> - * usage is of course NBuffers entries, but BufferAlloc() tries to insert
> - * a new entry before deleting the old. In principle this could be
> - * happening in each partition concurrently, so we could need as many as
> - * NBuffers + NUM_BUFFER_PARTITIONS entries.
> + * usage is of course NBuffers entries. But due to concurrent
> + * access to numerous free lists in dynahash we can miss free entry that
> + * moved between free lists. So it is better to have some spare free entries
> + * to reduce probability of entry allocations after server start.
> */
> InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
>
> Pre-patch, the comment claims that the maximum number of buffer
> entries that can be simultaneously used is limited to NBuffers +
> NUM_BUFFER_PARTITIONS, and that's why we make the hash table that
> size. The idea is that we normally need more than 1 entry per buffer,
> but sometimes we might have 2 entries for the same buffer if we're in
> the process of changing the buffer tag, because we make the new entry
> before removing the old one. To change the buffer tag, we need the
> buffer mapping lock for the old partition and the new one, but if both
> are the same, we need only one buffer mapping lock. That means that in
> the worst case, you could have a number of processes equal to
> NUM_BUFFER_PARTITIONS each in the process of changing the buffer tag
> between values that both fall into the same partition, and thus each
> using 2 entries. Then you could have every other buffer in use and
> thus using 1 entry, for a total of NBuffers + NUM_BUFFER_PARTITIONS
> entries. Now I think you're saying we go far beyond that number, and
> what I wonder is how that's possible. If the system doesn't work the
> way the comment says it does, maybe we ought to start by talking about
> what to do about that.
At the master state:
- SharedBufHash is not declared as HASH_FIXED_SIZE
- get_hash_entry falls back to element_alloc too fast (just if it doesn't
found free entry in current freelist partition).
- get_hash_entry has races.
- if there are small number of spare items (and NUM_BUFFER_PARTITIONS is
small number) and HASH_FIXED_SIZE is set, it becomes contended and
therefore slow.
HASH_REUSE solves (for shared buffers) most of this issues. Free list
became rare fallback, so HASH_FIXED_SIZE for SharedBufHash doesn't lead
to performance hit. And with fair number of spare items, get_hash_entry
will find free entry despite its races.
> I am a bit confused by your description of having done "p
> SharedBufHash->hctl->allocated.value" because SharedBufHash is of type
> HTAB and HTAB's hctl member is of type HASHHDR, which has no field
> called "allocated".
Previous letter contains links to small patches that I used for
experiments. Link that adds "allocated" is https://pastebin.com/c5z0d5mz
> I thought maybe my analysis here was somehow
> mistaken, so I tried the debugger, which took the same view of it that
> I did:
>
> (lldb) p SharedBufHash->hctl->allocated.value
> error: <user expression 0>:1:22: no member named 'allocated' in 'HASHHDR'
> SharedBufHash->hctl->allocated.value
> ~~~~~~~~~~~~~~~~~~~ ^
-----
regards
Yura Sokolov
pgsql-hackers by date: