mysql database partition on extreme imbalance data?

Question

2 million customers produce 2 billion orders in 2 years, but according to history data, 50% orders are belong to 100 top customers, so the data is extreme imbalance on customer id. the customers wish search their own history orders by pay accounts(a customer has more than one pay accounts, some customers have thousands pay accounts).

the diffcult is the top customer is always changing and you don't know which customer will have huge orders in future month.

I need store 3 or 4 years orders data and provider a order search service for customers. how should I partition the database and tables? I mean if use hash on customer's pay account, some database will have huge data. I now have 35 servers and 600G storge on each server.

Offtopic: with that amount of orders consider hiring a database administrator (DBA).. Ontopic: MySQL should be able to handle billions or records just fine without partitions when indexed correctly and the schema is designed properly.. — Raymond Nijland
– Raymond Nijland, Commented Jul 25, 2019 at 13:49

Rick James · Accepted Answer · 2019-07-26 01:09:23Z

The key of the solution: Have customer_id as the first column of the PRIMARY KEY in all(?) tables that have that column. And, of course, include AND customer_id = 123 in the queries.

I don't understand "pay accounts", but if there is an acct_id, then probably you need

PRIMARY KEY(customer_id, acct_id, ...)

Since you probably already have id .. AUTO_INCREMENT PRIMARY KEY, change to

PRIMARY KEY(customer_id, acct_id, id)
INDEX(id)     -- sufficient to keep AUTO_INCREMENT happy

The revised PK clusters the rows that are likely to be used/searched/etc in most queries, thereby speeding them up.

The rows for the 'top' customers will mostly stay in the buffer_pool, thereby decreasing the need for I/O. As one customer gets busier, his rows will bump out a less busy customer. This is the nature of LRU caches. That is, the shifting of "who's on top" is mostly covered automatically.

A "hash" is not likely to help. In fact it may hurt, since it is very random and likely to lead to jumping around in the cache. (More later.)

Will you purge "old" data (after 4 years)? If so, this brings in another issue: DELETEing lots of rows from a huge table is costly. Related to this is the question of what rows are typically fetched -- Perhaps only "recent" rows?

If you need to purge, then PARTITION BY RANGE(TO_DAYS(...)) will greately speed up the DELETE (by turning it into DROP PARTITION). And it may have some impact on the question of usually looking only at 'recent' rows.

See this for discussion of time-series data. I would recommend arranging for the TO_DAYS() to land on month boundaries, thereby having about 50 partitions.

With partitioning, I would still design the PK as discussed above. However, it would help to have AND date > ... in most WHERE clauses, else all 50 partitions will be searched, and this will be a performance burden. (It is hard to say even whether it is a big enough burden to fight against having 50 partitions.)

You now have 35 servers and 600G storage on each server: Are you talking about sharding or replication? If Replication, do you mean one Master and 34 readonly Slaves?

If Sharding by Customer_id,

Build a robust script to move a Customer from one shard to another. This will be your key to many admin tasks: offloading an overloaded shard; upgrading hardware/software; adding another shard; etc.
That requires not just a "hash" but a dictionary for looking up where a given customer_id is. I like a combination: Hash to, say, 12 bits (0..4095), then look up which of the 35 (as of today) shards the given customer is on.

Collectives™ on Stack Overflow

mysql database partition on extreme imbalance data?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related