Efficient algorithm or method for shuffling a list with many duplicates?

Question

Let's say I have 5 marbles, 3 are red and 2 are blue. The marbles of the same color are numbered from 1 to n, where n is the number of marbles of that color.

What I need is way to shuffle the list ['red', 'red', 'red','blue', 'blue'] , where the permutations of the individual marbles does not matter.

E.g ['red1', 'red2', 'red3','blue', 'blue'] and ['red2', 'red1', 'red3','blue', 'blue'] are only one solution.

Because my original problem has a lot more marbles of one color, compared to the others, the normal shuffle of the list results in a high bias for certain orders.

One idea I had would be:

Start with a new empty list,
Generate 2 random integers (the first integer to choose a color, the second integer to determine how many marbles to pick of that color)
Append the picked marbles to the new list
Repeat until all marbles have been used up

I'm wondering if there are some better ways to solve this, maybe with a more guaranteed unbiased result.

Edit: A small scaled scenario would be 4 different colors with 2 to 5 marbles per color. A realistic scenario would be 20 diff. colors, with between 5 and 50 marbles per color.

Because this is a performance question, how many possible values are there, and how big is the list? — Michael Bianconi
– Michael Bianconi, Commented Feb 11, 2020 at 15:20
Are the colors supposed to not change position? You only want to shuffle by same colors? — Jab
– Jab, Commented Feb 11, 2020 at 15:34
"the normal shuffle of the list results in a high bias for certain ordes." What do you mean by this? A standard shuffling algorithm like Fisher-Yates (which random.shuffle implements) will give unbiased shuffles regardless of how many duplicates there are. — kaya3
– kaya3, Commented Feb 11, 2020 at 15:38
@kaya3 : If i want to shuffle ['red', 'red', 'blue] Fisher-Yates will handle it like ['red1','red2','blue'], if I understand the algorithm correctly. Which adds solutions that look exactly the same, but the distribution will change towards those results — Dominik
– Dominik, Commented Feb 11, 2020 at 15:50
It adds solutions which "look the same" in equal proportion to all solutions. For your example, Fisher-Yates will generate (red,red,blue), (red,blue,red) and (blue,red,red) with equal probability. How could it be any less biased? — kaya3
– kaya3, Commented Feb 11, 2020 at 16:30

Pete Kirkham · Accepted Answer · 2020-02-11 15:49:45Z

1

If you want an unbiased shuffle, then you are comparing your algorithm to Knuth's (slightly improved version of Fisher-Yate's) shuffle

If you are going to optimise using a two stage selection and want it to remain fair, you need to calculate the probabilities of getting runs of one, two, three of the same colour, and implement this calculation more efficiently than the corresponding number of swaps, and then do that number of copies from your sources into the target list.

I wouldn't bother for performance sake, as it is unlikely with 1000 marbles that you will improve the performance by doing the maths over the swap - you're going to have to do another list iteration on each stage to calculate the probabilities and then extra work which is going to make it O(N²) rather than O(N).

answered Feb 11, 2020 at 15:49

Pete Kirkham

49.5k5 gold badges96 silver badges176 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alain T. · Accepted Answer · 2020-02-15 03:37:50Z

This will give a uniform distribution of the uniques combinations:

from random import randrange
from collections import Counter
def marblePick(marbles):
    remaining = list(Counter(marbles).elements())
    return [remaining.pop(randrange(r)) for r in range(len(remaining),0,-1)]

The performance seems to be linear and I tested the distribution by sampling as follows:

marbles = {"R":3,"B":2}

N = 1000000
dist = Counter()
for _ in range(N):
    combo = marblePick(marbles)
    dist[tuple(combo)]+=1

for s,c in dist.items():
    print(s,c/N)

import statistics as stats
prop = [c/N for c in dist.values()]
mean = stats.mean(prop)
stdev = stats.stdev(prop)
print("mean:",mean,"stdDev:",f"{100*stdev/mean:.1f}%",N,"samples",len(dist),"combos")

the results are conclusive for this sample (and a few others I tried):

('B', 'R', 'B', 'R', 'R') 0.100245
('R', 'R', 'R', 'B', 'B') 0.099789
('R', 'R', 'B', 'B', 'R') 0.099952
('B', 'R', 'R', 'B', 'R') 0.100579
('R', 'R', 'B', 'R', 'B') 0.099732
('R', 'B', 'R', 'R', 'B') 0.100033
('B', 'B', 'R', 'R', 'R') 0.099861
('B', 'R', 'R', 'R', 'B') 0.099941
('R', 'B', 'R', 'B', 'R') 0.100103
('R', 'B', 'B', 'R', 'R') 0.099765
mean: 0.1 stdDev: 0.3% 1000000 samples 10 combos

Collectives™ on Stack Overflow

Efficient algorithm or method for shuffling a list with many duplicates?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related