2

I know that the distributed node does not combine intermediate results from shards by using distributed_group_by_no_merge.

The following SQL

select sum(xxxxx),xxxxx from (
    select sum(xxxx),xxxx 
    from (
        select count(xxx),xxx 
        from distributed_table group by xxx )  
    group by xxxx SETTINGS distributed_group_by_no_merge = 1
) group by xxxxx

I want to know that which part of sql will be sent to MergeTree node to execute by using distributed_group_by_no_merge? is it?select count(xxx),xxx from distributed_table group by xxx ) group by xxxx SETTINGS distributed_group_by_no_merge = 1

how does the parameter of distributed_group_by_no_merge change the behavior of distributed query?which part of sql execute on MergeTree node and which part of sql execute on distributed node?

1 Answer 1

4

distributed_group_by_no_merge-param affects the way how the initiator-node (it is a node which runs distributed query) will form the final result of a distributed query:

  • either by merging aggregated intermediate states coming from shards by itself (it required copying full aggregated intermediate states from shards to initiator-node) [distributed_group_by_no_merge = 0 (default mode)]

  • or get already final result from shards (when each shard merges an intermediate aggregation state on its side and send to initiator-node only the final result). It provides a significant improvement in performance and resource consumption but requires the right selection of the sharding key [distributed_group_by_no_merge = 1]


I would put distributed_group_by_no_merge at the same level of subquery where defined distributed table to explicitly define your intention and avoid confusion when there are several distributed-subqueries.


Let's look at the way how to check the differences between the two modes (will use _shard_num-virtual column):

  1. distributed_group_by_no_merge=0
SELECT
    groupUniqArray(_shard_num) AS shards,
    ..
FROM table
WHERE ..
GROUP BY ..
SETTINGS distributed_group_by_no_merge = 0

/* Aggregated states were merged into ONE result set on initiator-node.
┌─shards────┬─ ..
│ [2, 1, 3] │  ..
└───────────┴─ ..
*/
  1. distributed_group_by_no_merge=1
SELECT
    groupUniqArray(_shard_num) AS shards,
    ..
FROM table
WHERE ..
GROUP BY ..
SETTINGS distributed_group_by_no_merge = 1

/* Get a set of final results (not aggregated states) from each shard. They should be unioned manually.
┌─shards─┬─ ..
│ [2]    │  ..
│ [1]    │  ..
│ [3]    │  ..
└────────┴─ ..
*/

See https://clickhouse.com/docs/en/operations/settings/settings#distributed_group_by_no_merge.

How to avoid merging high cardinality sub-select aggregations on distributed tables

Sign up to request clarification or add additional context in comments.

2 Comments

1.when not using distributed_group_by_no_merge, subquery only internal part of a query from distributed_table will be executed on shads, all other parts outside ( ) will be executed at an initiator node. why not array join execute on mergetree node 2.when using distributed_group_by_no_merge=1 at the same level of subquery, the-same-level subquery will be executed on shards, and the outside aggregate query will be executed on distributed node. right?
1) I updated my answer, please look at it. 2) yes, 'outside aggregates' will be run on initial-node independently from value of distributed_group_by_no_merge-param. The key difference is where be calculated the final result of distributed query - on initiator-node or each shard separately (understanding the intermediate aggregation state and merging of intermediate aggregation states help to clarify this topic).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.