Avoid generating duplicate values from random

Question

I want to generate random numbers and store them in a list as the following:

alist = [random.randint(0, 2 ** mypower - 1) for _ in range(total)]

My concern is the following: I want to generate total=40 million values in the range of (0, 2 ** mypower - 1). If mypower = 64, then alist will be of size ~20GB (40M*64*8) which is very large for my laptop memory. I have an idea to iteratively generate chunk of values, say 5 million at a time, and save them to a file so that I don't have to generate all 40M values at once. My concern is that if I do that in a loop, it is guaranteed that random.randint(0, 2 ** mypower - 1) will not generate values that were already generated from the previous iteration? Something like this:

        for i in range(num_of_chunks):
            alist = [random.randint(0, 2 ** mypower - 1) for _ in range(chunk)]
            # save to file

This will not even be the case in the single-chunk situation. — fuglede
– fuglede, Commented Jul 20, 2017 at 20:46
random.randint returns values randomly. random.randint(x, y) has the same chance of returning any number in the range x, y each time you call it. You could calculate the chance you would get repeats, which I think would be pretty low considering the size you are sampling from, but if you want to completely avoid repeats, you would need to use something else. — LVFToENlCx
– LVFToENlCx, Commented Jul 20, 2017 at 20:49
Why "*68*8"? I'm quite certain you're very much overestimating this. — Stefan Pochmann
– Stefan Pochmann, Commented Jul 20, 2017 at 20:50
How in the world did you get that number? A list will take 8 bytes per pointer (on 64-bit CPython) so the list itself will take a shy bit over 300MB and the integers themselves will take 24 bytes per pop, so additional 915MB give or take - or ~1.2GB of memory for such a list. And this is pure Python, you can use much more optimized and streamlined libraries to cram more data in that memory. — zwer
– zwer, Commented Jul 20, 2017 at 20:55

jacoblaw · Accepted Answer · 2017-07-20 21:08:05Z

4

Well, since efficiency/speed doesn't matter, I think this will work:

s = set()
while len(s) < total:
    s.add(random.randint(0, 2 ** mypower - 1))
alist = list(s)

Since sets can only have unique elements in it, i think this will work well enough

edited Jul 20, 2017 at 21:08

answered Jul 20, 2017 at 20:56

jacoblaw

1,2839 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

jacoblaw Over a year ago

You're welcome. I'm sure there are better ways of doing it but I can't think of any.

Jean-François Fabre Over a year ago

using set is nice, but the more numbers you'll insert the slower it gets because you cannot know how many iterations you'll have to perform to reach the total. That said, random doesn't repeat much, so that's a good compromise.

jacoblaw Over a year ago

Ya im sure there are more efficient ways of doing it, I posted this method because op said it doesn't matter how long it takes to generate the list

Stefan Pochmann Over a year ago

@Jean-FrançoisFabre Likely it needs exactly 40 million iterations. Because likely it won't create any duplicates anyway.

Stefan Pochmann Over a year ago

@Jean-FrançoisFabre Ok I calculated it. The probability that no duplicates are generated anyway is about 99.9956%.

|

rossum · Accepted Answer · 2017-07-20 21:06:21Z

2

To guarantee unique values you should avoid using random. Instead you should use an encryption. Because encryption is reversible, unique inputs guarantee unique outputs, given the same key. Encrypt the numbers 0, 1, 2, 3, ... and you will get guaranteed unique random-seeming outputs back providing you use a secure encryption. Good encryption is designed to give random-seeming output.

Keep track of the key (essential) and how far you have got. For your first batch encrypt integers 0..5,000,000. For the second batch encrypt 5,000,001..10,000,000 and so on.

You want 64 bit numbers, so use DES in ECB mode. DES is a 64-bit cipher, so the output from each encryption will be 64 bits. ECB mode does have a weakness, but that only applies with identical inputs. You are supplying unique inputs so the weakness is not relevant for your particular application.

If you need to regenerate the same numbers, just re-encrypt them with the same key. If you need a different set of random numbers (which will duplicate some from the first set) then use a different key. The guarantee of uniqueness only applies with a fixed key.

answered Jul 20, 2017 at 21:06

rossum

15.7k2 gold badges26 silver badges40 bronze badges

9 Comments

Stefan Pochmann Over a year ago

Why would you still do it in batches with this?

rossum Over a year ago

The question talks about batching the numbers to help avoid overly large chunks of data. This method avoids having to cross-check between batches for duplicates: unique 'random' numbers can be generated as required, up to 2^64 - 1.

Stefan Pochmann Over a year ago

I disagree with the general-sounding "you should avoid using random" (after all, random.sample does exactly this functionality, and it does use random). But I'll upvote anyway for the creativity and because I think it works in this case (2^64 is rather special).

rossum Over a year ago

Random alone cannot "guarantee unique". Indeed any good RNG should be able to return the same number twice on the trot. That is why encryption, or shuffling, is often better when uniqueness is a requirement. There are ways round the 2^64 (or 2^128 for AES) boundary: use a Format Preserving Encryption to keep the numbers within the desired range.

Stefan Pochmann Over a year ago

Which is why random.sample doesn't use "random alone". Can you point out a Format Preserving Encryption algorithm which would work well here if the range were let's say [0, 10^19) instead of [0, 2^64)? (That's about the same size, just not a power of 2.) Preferably a simple one so I can understand it :-)

|

Jean-François Fabre · Accepted Answer · 2017-07-20 21:17:02Z

1

One way to generate random values that don't repeat is first to create a list of contiguous values

l = list(range(1000))

then shuffle it:

import random
random.shuffle(l)

You could do that several times, and save it in a file, but you'll have limited ranges since you'll never see the whole picture because of your limited memory (it's like trying to sort a big list without having the memory for it)

As someone noted, to get a wide span of random numbers, you'll need a lot of memory, so simple but not so efficient.

Another hack I just though of: do the same as above but generate a range using a step. Then in a second pass, add a random offset to the values. Even if the offset values repeat, it's guaranteed to never generate the same number twice:

import random

step = 10

l = list(range(0,1000-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]

with the required max value and number of iterations that gives:

import random

number_of_iterations = 40*10**6
max_number = 2**64
step = max_number//number_of_iterations

l = list(range(0,max_number-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]
print(len(newlist),len(set(newlist)))

runs in 1-2 minutes on my laptop, and gives 40000000 distinct values (evenly scattered across the range)

edited Jul 20, 2017 at 21:17

answered Jul 20, 2017 at 20:57

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

6 Comments

Stefan Pochmann Over a year ago

So, he's worried about a list of size 40 million being too big for his PC and your solution involves a list of size 18 quintillion?

Jean-François Fabre Over a year ago

I don't know how you compute your memory size, but I agree that even if my solution ensures that no number is repeated, you need to store every number up to the max limit. Well, Gives me another idea. Edited.

Stefan Pochmann Over a year ago

You mean you don't know how I got 18 quintillion? That's 2**64, his population size.

zwer Over a year ago

@Jean-FrançoisFabre - he probably computed the size by the fact that using your (first) approach you'd have to generate a list with all 64-bit numbers which would need, at least temporarily (before shuffling and cutting off the 40M slice) - 2^64 indices or ~18 quintillion (or in terms of real memory usage, around 2^64 * 35 bytes, or 645EB of memory).

Jean-François Fabre Over a year ago

you're both right. I know 2**64 is a big number from the story of the chessboard and the rice grains... but I never cared about the actual value. I have edited my answer, now feasible without too much memory.

|

Bruce Chidester · Accepted Answer · 2017-07-20 20:54:43Z

0

Usually random number generators are not really random at all. In fact, this is quite helpful in some situations. If you want the values to be unique after the second iteration, send it a different seed value.

random.seed()

The same seed will generate the same list, so if you want the next iteration to be the same, use the same seed. if you want it to be different, use a different seed.

answered Jul 20, 2017 at 20:54

Bruce Chidester

6211 gold badge5 silver badges16 bronze badges

Comments

Buzgush · Accepted Answer · 2017-07-20 20:58:29Z

0

It may need High CPU and Physical memory! I suggest you to classify your data. For Example you can Save: All numbers starting with 10 and are 5 characters(Example:10365) to 10-5.txt

All numbers starting with 11 and are 6 characters(Example:114567) to 11-6.txt

Then for checking a new number: For Example my number is 9256547 It starts with 92 and is 7 characters. So I search 92-7.txt for this number and if it isn't duplicate,I will add it to 92-7.txt Finally,You can join all files together. Sorry If I have mistakes.My main language isn't English.

answered Jul 20, 2017 at 20:58

Buzgush

13 bronze badges

Collectives™ on Stack Overflow

Avoid generating duplicate values from random

5 Answers 5

7 Comments

9 Comments

6 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

7 Comments

9 Comments

6 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related