How to partition data by multiple fields?

Question

Lets say I have a record with 4 identifier variables: var1, var2, var3, var4 and an additional variable: var5.

I want to perform a reduce operation on all the records that have the same value in either one of the identifier fields. I'm trying to think how can I implement this kind of solution with a minimal amount of shuffling.

Is there an option to tell Spark to put all the records that have at least one match in the identifier variables on the same partition? I know that there is an option of custom partitioner but I'm not sure if it is possible to implement it in a way that will support my use case.

zero323 · Accepted Answer · 2015-08-02 23:40:00Z

Well, quite a lot depends on a structure of your data in general and how much a priori knowledge you have.

In the worst case scenario, when your data is relatively dense and uniformly distributed like below and you perform one-off analysis, the only way to achieve your goal seems to be to put everything into one partition.

[1 0 1 0]
[0 1 0 1]
[1 0 0 1]

Obviously it is not a very useful approach. One thing you can try instead is to analyze at least a subset of your data to get insight into its structure and try to use this knowledge to build a custom partitioner which assures relatively low traffic and reasonable distribution over the cluster at the same time.

As a general framework it would try one of these:

Hashing

choose some number of buckets
for each row create binary vector of length equal to number of buckets
- for each feature in a row
  - hash feature to bucket
    - if hash(bucket) == 0 flip to 1
    - otherwise do nothing
sum computed vectors to get summary statistics
use optimization technique of your choice to create a function from the hash to the partition

Frequent itemsets

use one of the algorithms like apriori, closed-patterns, max-patterns on a sample of your data, FP-growth
compute distribution of the itemsets over the sample
use optimization to compute hash as above

Both solutions are computationally intensive and require quite lot of work to implement so it is probably not worth all the fuss for ad hoc analytics but if you have reusable pipeline it may be worth trying.

Daniel Darabos · Accepted Answer · 2015-07-30 14:17:31Z

1

This is not generally possible. Imagine you have X with keys (x, x, x, x) and Y with (y, y, y, y). No reason to put them in the same partition, right? But now comes Z with keys (x, x, y, y). This has to be in the same partition as X and also in the same partition as Y. It is not possible.

I suggest just taking the shuffle. Create 4 RDDs, each partitioned by a different key.

answered Jul 30, 2015 at 14:17

Daniel Darabos

27.6k10 gold badges108 silver badges122 bronze badges

1 Comment

Michael Kopaniov Over a year ago

You are absolutely right. Unfortunately I have more than 4 identifiers in the real world scenario. But I will concentrate on performing the shuffling in the right order, hopefully it will give me reasonable performance. Thank you.

Collectives™ on Stack Overflow

How to partition data by multiple fields?

2 Answers 2

Hashing

Frequent itemsets

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Hashing

Frequent itemsets

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related