I'm using clickhouse replication and plan to shard data across shards/nodes. For local replica I want use AggregatingMergeTree engine , so question is should I use some specific sharding key for distributed table that is based on replicas ? Can I use rand() ?
-
Check these ones: stackoverflow.com/questions/66296329/… and stackoverflow.com/questions/61743180/….vladimir– vladimir2023-11-06 23:58:57 +00:00Commented Nov 6, 2023 at 23:58
-
Does this answer your question? what is the best way to choose shard key in clickhouse?vladimir– vladimir2023-11-07 00:03:43 +00:00Commented Nov 7, 2023 at 0:03
-
Question is not what is best candidate for shading key , question is should I use some specific key to shard data , for example to make ReplacingMergeTree working properly I need to have same combination of values defining row version be stored in one shard , should I do the same with AggregatingMergeTree ? Or it is enough to use rand() and don't worry about result of mergeState for example. But anyway thanks you for links it is worth to examine but probably now it will make mess in my head.Alexandr– Alexandr2023-11-07 00:23:43 +00:00Commented Nov 7, 2023 at 0:23
Add a comment
|
1 Answer
Your question mentioned AggregatingMergeTree which probably works fine with rand(), but then you mentioned deduplication which is different.
If you are using ReplacingMergeTree, then it can make sense to pick a sharding key that puts rows with the same primary key onto the same shard. You will get better deduplication this way because eventually those older rows can get deleted. You will still need to use FINAL on your queries, or better yet structure your query logic to avoid the duplicate rows.
2 Comments
Alexandr
I just thought that AggregateMergeTree use the same strategy as ReplacingMergeTree while deduplication/aggregating. So this is good that I can use AggregateMergeTree with rand() sharding key , i dont need some specific distribution of data across the nodes. Thank you.
Rich Raposa
You're right - it's a similar logic. You could probably use a clever sharding key for that table as well. Without any details of what you've done and how your data looks, it's hard to say what the best route to take is