Well, quite a lot depends on a structure of your data in general and how much a priori knowledge you have.
In the worst case scenario, when your data is relatively dense and uniformly distributed like below and you perform one-off analysis, the only way to achieve your goal seems to be to put everything into one partition.
[1 0 1 0]
[0 1 0 1]
[1 0 0 1]
Obviously it is not a very useful approach. One thing you can try instead is to analyze at least a subset of your data to get insight into its structure and try to use this knowledge to build a custom partitioner which assures relatively low traffic and reasonable distribution over the cluster at the same time.
As a general framework it would try one of these:
Hashing
- choose some number of buckets
- for each row create binary vector of length equal to number of buckets
- for each feature in a row
- hash feature to bucket
- if hash(bucket) == 0 flip to 1
- otherwise do nothing
- sum computed vectors to get summary statistics
- use optimization technique of your choice to create a function from the hash to the partition
Frequent itemsets
- use one of the algorithms like apriori, closed-patterns, max-patterns on a sample of your data, FP-growth
- compute distribution of the itemsets over the sample
- use optimization to compute hash as above
Both solutions are computationally intensive and require quite lot of work to implement so it is probably not worth all the fuss for ad hoc analytics but if you have reusable pipeline it may be worth trying.