@@ -303,25 +303,33 @@ Oversized-Attribute Storage Technique).
303303
304304<para>
305305<productname>PostgreSQL</productname> uses a fixed page size (commonly
306- 8 kB), and does not allow tuples to span multiple pages. Therefore, it is
306+ 8 kB), and does not allow tuples to span multiple pages. Therefore, it is
307307not possible to store very large field values directly. To overcome
308- this limitation, large field values are compressed and/or broken up into
309- multiple physical rows. This happens transparently to the user, with only
308+ this limitation, large field values are compressed and/or broken up into
309+ multiple physical rows. This happens transparently to the user, with only
310310small impact on most of the backend code. The technique is affectionately
311- known as <acronym>TOAST</> (or <quote>the best thing since sliced bread</>).
311+ known as <acronym>TOAST</> (or <quote>the best thing since sliced bread</>).
312+ The <acronym>TOAST</> infrastructure is also used to improve handling of
313+ large data values in-memory.
312314</para>
313315
314316<para>
315317Only certain data types support <acronym>TOAST</> — there is no need to
316318impose the overhead on data types that cannot produce large field values.
317319To support <acronym>TOAST</>, a data type must have a variable-length
318- (<firstterm>varlena</>) representation, in which the first 32-bit word of any
319- stored value contains the total length of the value in bytes (including
320- itself). <acronym>TOAST</> does not constrain the rest of the representation.
321- All the C-level functions supporting a <acronym>TOAST</>-able data type must
322- be careful to handle <acronym>TOAST</>ed input values. (This is normally done
323- by invoking <function>PG_DETOAST_DATUM</> before doing anything with an input
324- value, but in some cases more efficient approaches are possible.)
320+ (<firstterm>varlena</>) representation, in which, ordinarily, the first
321+ four-byte word of any stored value contains the total length of the value in
322+ bytes (including itself). <acronym>TOAST</> does not constrain the rest
323+ of the data type's representation. The special representations collectively
324+ called <firstterm><acronym>TOAST</>ed values</firstterm> work by modifying or
325+ reinterpreting this initial length word. Therefore, the C-level functions
326+ supporting a <acronym>TOAST</>-able data type must be careful about how they
327+ handle potentially <acronym>TOAST</>ed input values: an input might not
328+ actually consist of a four-byte length word and contents until after it's
329+ been <firstterm>detoasted</>. (This is normally done by invoking
330+ <function>PG_DETOAST_DATUM</> before doing anything with an input value,
331+ but in some cases more efficient approaches are possible.
332+ See <xref linkend="xtypes-toast"> for more detail.)
325333</para>
326334
327335<para>
@@ -333,58 +341,84 @@ the value is an ordinary un-<acronym>TOAST</>ed value of the data type, and
333341the remaining bits of the length word give the total datum size (including
334342length word) in bytes. When the highest-order or lowest-order bit is set,
335343the value has only a single-byte header instead of the normal four-byte
336- header, and the remaining bits give the total datum size (including length
337- byte) in bytes. As a special case, if the remaining bits are all zero
338- (which would be impossible for a self-inclusive length), the value is a
339- pointer to out-of-line data stored in a separate TOAST table. (The size of
340- a TOAST pointer is given in the second byte of the datum.)
341- Values with single-byte headers aren't aligned on any particular
342- boundary, either. Lastly, when the highest-order or lowest-order bit is
343- clear but the adjacent bit is set, the content of the datum has been
344- compressed and must be decompressed before use. In this case the remaining
345- bits of the length word give the total size of the compressed datum, not the
344+ header, and the remaining bits of that byte give the total datum size
345+ (including length byte) in bytes. This alternative supports space-efficient
346+ storage of values shorter than 127 bytes, while still allowing the data type
347+ to grow to 1 GB at need. Values with single-byte headers aren't aligned on
348+ any particular boundary, whereas values with four-byte headers are aligned on
349+ at least a four-byte boundary; this omission of alignment padding provides
350+ additional space savings that is significant compared to short values.
351+ As a special case, if the remaining bits of a single-byte header are all
352+ zero (which would be impossible for a self-inclusive length), the value is
353+ a pointer to out-of-line data, with several possible alternatives as
354+ described below. The type and size of such a <firstterm>TOAST pointer</>
355+ are determined by a code stored in the second byte of the datum.
356+ Lastly, when the highest-order or lowest-order bit is clear but the adjacent
357+ bit is set, the content of the datum has been compressed and must be
358+ decompressed before use. In this case the remaining bits of the four-byte
359+ length word give the total size of the compressed datum, not the
346360original data. Note that compression is also possible for out-of-line data
347361but the varlena header does not tell whether it has occurred —
348- the content of the TOAST pointer tells that, instead.
362+ the content of the <acronym> TOAST</> pointer tells that, instead.
349363</para>
350364
351365<para>
352- If any of the columns of a table are <acronym>TOAST</>-able, the table will
353- have an associated <acronym>TOAST</> table, whose OID is stored in the table's
354- <structname>pg_class</>.<structfield>reltoastrelid</> entry. Out-of-line
355- <acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as
356- described in more detail below.
366+ As mentioned, there are multiple types of <acronym>TOAST</> pointer datums.
367+ The oldest and most common type is a pointer to out-of-line data stored in
368+ a <firstterm><acronym>TOAST</> table</firstterm> that is separate from, but
369+ associated with, the table containing the <acronym>TOAST</> pointer datum
370+ itself. These <firstterm>on-disk</> pointer datums are created by the
371+ <acronym>TOAST</> management code (in <filename>access/heap/tuptoaster.c</>)
372+ when a tuple to be stored on disk is too large to be stored as-is.
373+ Further details appear in <xref linkend="storage-toast-ondisk">.
374+ Alternatively, a <acronym>TOAST</> pointer datum can contain a pointer to
375+ out-of-line data that appears elsewhere in memory. Such datums are
376+ necessarily short-lived, and will never appear on-disk, but they are very
377+ useful for avoiding copying and redundant processing of large data values.
378+ Further details appear in <xref linkend="storage-toast-inmemory">.
357379</para>
358380
359381<para>
360- The compression technique used is a fairly simple and very fast member
382+ The compression technique used for either in-line or out-of-line compressed
383+ data is a fairly simple and very fast member
361384of the LZ family of compression techniques. See
362385<filename>src/common/pg_lzcompress.c</> for the details.
363386</para>
364387
388+ <sect2 id="storage-toast-ondisk">
389+ <title>Out-of-line, on-disk TOAST storage</title>
390+
391+ <para>
392+ If any of the columns of a table are <acronym>TOAST</>-able, the table will
393+ have an associated <acronym>TOAST</> table, whose OID is stored in the table's
394+ <structname>pg_class</>.<structfield>reltoastrelid</> entry. On-disk
395+ <acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as
396+ described in more detail below.
397+ </para>
398+
365399<para>
366400Out-of-line values are divided (after compression if used) into chunks of at
367401most <symbol>TOAST_MAX_CHUNK_SIZE</> bytes (by default this value is chosen
368402so that four chunk rows will fit on a page, making it about 2000 bytes).
369- Each chunk is stored
370- as a separate row in the <acronym>TOAST</> table for the owning table. Every
403+ Each chunk is stored as a separate row in the <acronym>TOAST</> table
404+ belonging to the owning table. Every
371405<acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OID
372406identifying the particular <acronym>TOAST</>ed value),
373407<structfield>chunk_seq</> (a sequence number for the chunk within its value),
374408and <structfield>chunk_data</> (the actual data of the chunk). A unique index
375409on <structfield>chunk_id</> and <structfield>chunk_seq</> provides fast
376- retrieval of the values. A pointer datum representing an out-of-line
410+ retrieval of the values. A pointer datum representing an out-of-line on-disk
377411<acronym>TOAST</>ed value therefore needs to store the OID of the
378412<acronym>TOAST</> table in which to look and the OID of the specific value
379413(its <structfield>chunk_id</>). For convenience, pointer datums also store the
380- logical datum size (original uncompressed data length) and actual stored size
414+ logical datum size (original uncompressed data length) and physical stored size
381415(different if compression was applied). Allowing for the varlena header bytes,
382- the total size of a <acronym>TOAST</> pointer datum is therefore 18 bytes
383- regardless of the actual size of the represented value.
416+ the total size of an on-disk <acronym>TOAST</> pointer datum is therefore 18
417+ bytes regardless of the actual size of the represented value.
384418</para>
385419
386420<para>
387- The <acronym>TOAST</> code is triggered only
421+ The <acronym>TOAST</> management code is triggered only
388422when a row value to be stored in a table is wider than
389423<symbol>TOAST_TUPLE_THRESHOLD</> bytes (normally 2 kB).
390424The <acronym>TOAST</> code will compress and/or move
@@ -397,8 +431,8 @@ none of the out-of-line values change.
397431</para>
398432
399433<para>
400- The <acronym>TOAST</> code recognizes four different strategies for storing
401- <acronym>TOAST</>-able columns:
434+ The <acronym>TOAST</> management code recognizes four different strategies
435+ for storing <acronym>TOAST</>-able columns on disk :
402436
403437 <itemizedlist>
404438 <listitem>
@@ -460,6 +494,41 @@ pages). There was no run time difference compared to an un-<acronym>TOAST</>ed
460494comparison table, in which all the HTML pages were cut down to 7 kB to fit.
461495</para>
462496
497+ </sect2>
498+
499+ <sect2 id="storage-toast-inmemory">
500+ <title>Out-of-line, in-memory TOAST storage</title>
501+
502+ <para>
503+ <acronym>TOAST</> pointers can point to data that is not on disk, but is
504+ elsewhere in the memory of the current server process. Such pointers
505+ obviously cannot be long-lived, but they are nonetheless useful. There
506+ is currently just one sub-case:
507+ pointers to <firstterm>indirect</> data.
508+ </para>
509+
510+ <para>
511+ Indirect <acronym>TOAST</> pointers simply point at a non-indirect varlena
512+ value stored somewhere in memory. This case was originally created merely
513+ as a proof of concept, but it is currently used during logical decoding to
514+ avoid possibly having to create physical tuples exceeding 1 GB (as pulling
515+ all out-of-line field values into the tuple might do). The case is of
516+ limited use since the creator of the pointer datum is entirely responsible
517+ that the referenced data survives for as long as the pointer could exist,
518+ and there is no infrastructure to help with this.
519+ </para>
520+
521+ <para>
522+ For all types of in-memory <acronym>TOAST</> pointer, the <acronym>TOAST</>
523+ management code ensures that no such pointer datum can accidentally get
524+ stored on disk. In-memory <acronym>TOAST</> pointers are automatically
525+ expanded to normal in-line varlena values before storage — and then
526+ possibly converted to on-disk <acronym>TOAST</> pointers, if the containing
527+ tuple would otherwise be too big.
528+ </para>
529+
530+ </sect2>
531+
463532</sect1>
464533
465534<sect1 id="storage-fsm">
0 commit comments