0

I am using the function numpy.random.choice for generating random samples at once. But I'd like all the samples to be different. Is somebody aware of a function doing this? Explicitly, I'd like to have this:

import numpy as np
a = np.random.choice(62, size=(1000000, 8))
assert( len(set([tuple(a[i]) for i in range(a.shape[0])])) == a.shape[0])

The values on the integers can be replaced. The only which is required is that all row entries to be different.

8
  • 1
    Use replace=False. See the documentation. Commented Nov 17, 2021 at 12:29
  • Sorry we can replace, the only requirement is for the entries to be all different. Going to explicit this. Commented Nov 17, 2021 at 12:31
  • If you can replace (numbers), then the entries are not all different. Commented Nov 17, 2021 at 12:33
  • You can also generate an array (with arange and tile, perhaps), and then shuffle it. Shuffle only works along the first axis for a multidimensional array, which may be what you need. Commented Nov 17, 2021 at 12:35
  • 1
    I think the intent here is that one row, not one integer, is a sample. All the rows must be distinct, but individual numbers are expected to repeat many times. Commented Nov 17, 2021 at 12:40

1 Answer 1

5

This answer has been streamlined. Obsolete approaches can be found in the edit history.


First things first, if you have a numpy version >= 1.17 avoid using np.random.choice for the recommended method:

rng = np.random.default_rng()
rng.choice

Each sample has 8 values and for max_value = 62 you have 62**8 unique samples. Wanting to get just 1 million of them means 99.8% of the time they will all be unique in one draw according to the birtday problem. In this case it suffices to generate the whole array and do a simple check.

samples = 1000000
while True:
    a = np.random.choice(62, size=(samples, 8))
    # Credit to Mark Dickinson, this is faster than doing
    # `len(set(tuple(row) for row in a)) == samples`
    if np.unique(a, axis=0).shape[0] == samples:
        break

For lower values of max_value (less than 30) you may generate duplicates with enough frequency/certainty that the above approach may become ineffecient or even an infinite loop. It is then better to generate the whole array, keep any unique samples in a set and generate however many more you require. Iterate this process until you have as many as you need.

seen = set()
a = []
while len(a) < samples:
    draws = np.random.choice(62, size=(samples-len(a), 8))
    for draw in draws:
        if t := tuple(draw) not in seen:
            seen.add(t)
            a.append(draw)
a = np.array(a)

This assumes the number of samples you want to draw is much smaller than the tolar number of unique samples. If for example the total was 1001 samples and you wanted to draw 1000, this approach would quickly become inefficient.

Sign up to request clarification or add additional context in comments.

6 Comments

Hello @Reti43, I ended doing something like your last chunk of code. Indeed the firs case will certainly dont need additional checkups, but I have other cases and I can sacrifice speed a little bit. Thank you.
For your first approach, you could stay in NumPy land (and probably gain some speed) by using np.unique: that is, do a = np.unique(np.random.choice(62, size=(samples, 8)), axis=0) and then repeat until a has the right shape. It does return a sorted result, so if that's not what the OP wants an extra shuffle afterwards might be needed.
There doesn't seem to be a whole lot in it speed-wise: on my machine, after generating a sample a with a = np.random.choice(62, size=(10**6, 8)), np.unique(a, axis=0).shape[0] takes around 1.1s, while len(set(tuple(row) for row in a)) takes around 1.4s.
@MarkDickinson This is interesting. np.unique also takes 1.1 sec on my machine, but the set comprehension 2 sec. If you are in a situation where you expect to very likely not have duplicates, this is a faster test. However, I struggle to see how it could be incorporated in a performant way if you have to reroll some samples. Rerolling the duplicates rows and concatenating to the already unique samples is about as fast as rerolling everything again. But if a shuffle is required, that is going to hurt.
@JuanChô I've edited the answer to improve the code performance. The set method can be improved by not just drawing the samples one by one, but as many as you need in one go. This effectively makes it very similar (and actually a bit faster) than the defaultlist method, so I've removed the latter.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.