0

I'm using clickhouse replication and plan to shard data across shards/nodes. For local replica I want use AggregatingMergeTree engine , so question is should I use some specific sharding key for distributed table that is based on replicas ? Can I use rand() ?

3
  • Check these ones: stackoverflow.com/questions/66296329/… and stackoverflow.com/questions/61743180/…. Commented Nov 6, 2023 at 23:58
  • Does this answer your question? what is the best way to choose shard key in clickhouse? Commented Nov 7, 2023 at 0:03
  • Question is not what is best candidate for shading key , question is should I use some specific key to shard data , for example to make ReplacingMergeTree working properly I need to have same combination of values defining row version be stored in one shard , should I do the same with AggregatingMergeTree ? Or it is enough to use rand() and don't worry about result of mergeState for example. But anyway thanks you for links it is worth to examine but probably now it will make mess in my head. Commented Nov 7, 2023 at 0:23

1 Answer 1

1

Your question mentioned AggregatingMergeTree which probably works fine with rand(), but then you mentioned deduplication which is different.

If you are using ReplacingMergeTree, then it can make sense to pick a sharding key that puts rows with the same primary key onto the same shard. You will get better deduplication this way because eventually those older rows can get deleted. You will still need to use FINAL on your queries, or better yet structure your query logic to avoid the duplicate rows.

Sign up to request clarification or add additional context in comments.

2 Comments

I just thought that AggregateMergeTree use the same strategy as ReplacingMergeTree while deduplication/aggregating. So this is good that I can use AggregateMergeTree with rand() sharding key , i dont need some specific distribution of data across the nodes. Thank you.
You're right - it's a similar logic. You could probably use a clever sharding key for that table as well. Without any details of what you've done and how your data looks, it's hard to say what the best route to take is

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.