1

Context

I am using postgres as a multi-tenant graph database.

All data is stored in a single table, and I rely on partial indexes to make queries efficient.

Here's how a minimal schema looks:

CREATE TABLE triples (
    app_id text NOT NULL,
    entity_id text NOT NULL,
    attr_id text NOT NULL,
    value jsonb NOT NULL,
    eav boolean NOT NULL DEFAULT false,
    ave boolean NOT NULL DEFAULT false,
    vae boolean NOT NULL DEFAULT false,
    created_at bigint NOT NULL DEFAULT 0,
    checked_data_type text
);

CREATE OR REPLACE FUNCTION triples_extract_number_value(value jsonb)
RETURNS double precision AS $$
BEGIN
  IF jsonb_typeof(value) = 'number' THEN
    RETURN value::double precision;
  ELSE
    RETURN NULL;
  END IF;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

CREATE INDEX vae_idx ON triples (app_id, value, attr_id, entity_id) WHERE vae;

CREATE INDEX ave_number_idx ON triples (app_id, attr_id, triples_extract_number_value(value), entity_id) WHERE ave AND checked_data_type = 'number';

In an app with conversations, groups, and messages, data could be stored as:

{app_id: 'app_1', entity_id: 'convo_1', attr_id: 'title': "My Conversation"} 
{app_id: 'app_1', entity_id: 'group_1', attr_id: 'title': "My Group"} 
{app_id: 'app_1', entity_id: 'convo_1', attr_id: 'group': "'group_1'", vae: true} // this 'links' convos to groups
{app_id: 'app_1', entity_id: 'msg_1', attr_id: 'time': "123", ave: true, checked_data_type: 'number'} // this indexes our 'time' field  
{app_id: 'app_1', entity_id: 'msg_1', attr_id: 'convo': "'convo_1'", vae: true} // this  links `msg_1` to `convo_1`

Goal

I want to satisfy the following query:

"Give me all convos, which belong to group 'group-1', and have a message who's time is greater than 5."

To do this, I wrote the following query, which does 3 self-joins:

SELECT 
    DISTINCT(match_0_0.entity_id)
FROM 
    triples AS match_0_0
JOIN 
    triples AS match_0_1
    ON match_0_1.app_id = match_0_0.app_id
    AND match_0_1.vae = true
    AND match_0_1.attr_id = 'convo'
    AND match_0_1.value = to_jsonb(match_0_0.entity_id)
JOIN 
    triples AS match_0_2
    ON match_0_2.app_id = match_0_1.app_id
    AND match_0_2.ave = true
    AND match_0_2.attr_id = 'time'
    AND triples_extract_number_value(match_0_2.value) >= 5
    AND match_0_2.checked_data_type = 'number'
    AND match_0_2.entity_id = match_0_1.entity_id
WHERE 
    match_0_0.app_id = 'chat_app'
    AND match_0_0.vae = true
    AND match_0_0.attr_id = 'groups'
    AND match_0_0.value = '"group_1"';

Problem

The problem is, this query takes about 14 seconds to satisfy. Running with EXPLAIN (ANALYZE, BUFFERS), shows:

QUERY PLAN
Unique (cost=1.10..51.92 rows=1 width=8) (actual time=0.588..14786.024 rows=150 loops=1)
Buffers: shared hit=523495
-> Nested Loop (cost=1.10..51.92 rows=1 width=8) (actual time=0.588..14784.589 rows=5996 loops=1)
Buffers: shared hit=523495
-> Nested Loop (cost=0.82..37.31 rows=4 width=25) (actual time=0.048..29.058 rows=12000 loops=1)
Buffers: shared hit=1499
-> Index Only Scan using vae_idx on triples match_0_0 (cost=0.41..11.98 rows=3 width=17) (actual time=0.029..0.533 rows=300 loops=1)
Index Cond: ((app_id = 'chat_app'::text) AND (value = '"group_1"'::jsonb) AND (attr_id = 'groups'::text))
Heap Fetches: 300
Buffers: shared hit=55
-> Index Only Scan using vae_idx on triples match_0_1 (cost=0.41..8.44 rows=1 width=34) (actual time=0.016..0.079 rows=40 loops=300)
Index Cond: ((app_id = 'chat_app'::text) AND (value = to_jsonb(match_0_0.entity_id)) AND (attr_id = 'convo'::text))
Heap Fetches: 12000
Buffers: shared hit=1444
-> Index Scan using ave_number_idx on triples match_0_2 (cost=0.28..3.64 rows=1 width=17) (actual time=0.929..1.229 rows=0 loops=12000)
Index Cond: ((app_id = 'chat_app'::text) AND (attr_id = 'time'::text) AND (triples_extract_number_value(value) >= '5'::double precision) AND (entity_id = match_0_1.entity_id))
Buffers: shared hit=521996
Planning:
Buffers: shared hit=109
Planning Time: 0.477 ms
Execution Time: 14786.197 ms

Nested Loop

Looking at EXPLAIN (ANALYZE, BUFFERS), I noticed that the nested_loop join has lots of buffer hits.

If I try disabling nested loop joins, the query resolves in 40 ms:

SET enable_nestloop TO off;

SELECT 
    DISTINCT(match_0_0.entity_id)
FROM 
    triples AS match_0_0
JOIN 
    triples AS match_0_1
    ON match_0_1.app_id = match_0_0.app_id
    AND match_0_1.vae = true
    AND match_0_1.attr_id = 'convo'
    AND match_0_1.value = to_jsonb(match_0_0.entity_id)
JOIN 
    triples AS match_0_2
    ON match_0_2.app_id = match_0_1.app_id
    AND match_0_2.ave = true
    AND match_0_2.attr_id = 'time'
    AND triples_extract_number_value(match_0_2.value) >= 5
    AND match_0_2.checked_data_type = 'number'
    AND match_0_2.entity_id = match_0_1.entity_id
WHERE 
    match_0_0.app_id = 'chat_app'
    AND match_0_0.vae = true
    AND match_0_0.attr_id = 'groups'
    AND match_0_0.value = '"group_1"';

Question

Is there a way I can hint to postgres, so it chooses a better strategy?

Repro

I set up a repro on DB Fiddle, which shows the slow query:

https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/15620

1 Answer 1

1

There were a few ways we discovered to hint to Postgres that a nested loop join was a bad idea here.

Option 1: Make nested_loop more expensive

If we adjusted some PG parameters:

SET random_page_cost = 2.0;
SET cpu_index_tuple_cost = 0.05;
SET cpu_operator_cost = 0.0001;

Then PG would use a hash join for this query.

We ran a backtest on other queries though, and this resulted in poorer performance overall.

Option 2: Add an index where entity_id comes before value

If we made an index where entity_id came before value, Postgres would switch to a hash join and use it:

CREATE INDEX aev_number_idx ON triples (app_id, attr_id, entity_id, triples_extract_number_value(value)) WHERE ave AND checked_data_type = 'number';

However, this didn't feel right to us.

We couldn't delete the ave_number_idx, as there are scenarios where we want to use value, to find entity_id.

This means we would have to dupe our data on this new index.

Option 3: Remove entity_id from ave_number_idx

The final option, was to remove entity_id in ave_number_idx.

CREATE INDEX ave_number_idx_no_e ON triples (app_id, attr_id, triples_extract_number_value(value)) WHERE ave AND checked_data_type = 'number';

This forced Postgres to do hash joins. Running a backtest, we didn't see any queries get slower.

We ended up choosing Option 3.

More alternatives

One method we tried was to re-write the query as materialized CTEs. However, this only sporradically worked. Some ways that we wrote the CTE would cause a hash join, and some ways would not.

We tried using extended statistics too, but were not able to create a statistic that told postgres this nested loop join was a bad idea.

1
  • Hi Stepan, you are allowed to tick your own answer after 24 hrs. Commented Dec 11, 2024 at 3:38

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.