3

I want to generate random numbers and store them in a list as the following:

alist = [random.randint(0, 2 ** mypower - 1) for _ in range(total)]

My concern is the following: I want to generate total=40 million values in the range of (0, 2 ** mypower - 1). If mypower = 64, then alist will be of size ~20GB (40M*64*8) which is very large for my laptop memory. I have an idea to iteratively generate chunk of values, say 5 million at a time, and save them to a file so that I don't have to generate all 40M values at once. My concern is that if I do that in a loop, it is guaranteed that random.randint(0, 2 ** mypower - 1) will not generate values that were already generated from the previous iteration? Something like this:

        for i in range(num_of_chunks):
            alist = [random.randint(0, 2 ** mypower - 1) for _ in range(chunk)]
            # save to file
15
  • no it is not guaranteed at all! Commented Jul 20, 2017 at 20:44
  • 1
    This will not even be the case in the single-chunk situation. Commented Jul 20, 2017 at 20:46
  • random.randint returns values randomly. random.randint(x, y) has the same chance of returning any number in the range x, y each time you call it. You could calculate the chance you would get repeats, which I think would be pretty low considering the size you are sampling from, but if you want to completely avoid repeats, you would need to use something else. Commented Jul 20, 2017 at 20:49
  • 1
    Why "*68*8"? I'm quite certain you're very much overestimating this. Commented Jul 20, 2017 at 20:50
  • 1
    How in the world did you get that number? A list will take 8 bytes per pointer (on 64-bit CPython) so the list itself will take a shy bit over 300MB and the integers themselves will take 24 bytes per pop, so additional 915MB give or take - or ~1.2GB of memory for such a list. And this is pure Python, you can use much more optimized and streamlined libraries to cram more data in that memory. Commented Jul 20, 2017 at 20:55

5 Answers 5

4

Well, since efficiency/speed doesn't matter, I think this will work:

s = set()
while len(s) < total:
    s.add(random.randint(0, 2 ** mypower - 1))
alist = list(s)

Since sets can only have unique elements in it, i think this will work well enough

Sign up to request clarification or add additional context in comments.

7 Comments

You're welcome. I'm sure there are better ways of doing it but I can't think of any.
using set is nice, but the more numbers you'll insert the slower it gets because you cannot know how many iterations you'll have to perform to reach the total. That said, random doesn't repeat much, so that's a good compromise.
Ya im sure there are more efficient ways of doing it, I posted this method because op said it doesn't matter how long it takes to generate the list
@Jean-FrançoisFabre Likely it needs exactly 40 million iterations. Because likely it won't create any duplicates anyway.
@Jean-FrançoisFabre Ok I calculated it. The probability that no duplicates are generated anyway is about 99.9956%.
|
2

To guarantee unique values you should avoid using random. Instead you should use an encryption. Because encryption is reversible, unique inputs guarantee unique outputs, given the same key. Encrypt the numbers 0, 1, 2, 3, ... and you will get guaranteed unique random-seeming outputs back providing you use a secure encryption. Good encryption is designed to give random-seeming output.

Keep track of the key (essential) and how far you have got. For your first batch encrypt integers 0..5,000,000. For the second batch encrypt 5,000,001..10,000,000 and so on.

You want 64 bit numbers, so use DES in ECB mode. DES is a 64-bit cipher, so the output from each encryption will be 64 bits. ECB mode does have a weakness, but that only applies with identical inputs. You are supplying unique inputs so the weakness is not relevant for your particular application.

If you need to regenerate the same numbers, just re-encrypt them with the same key. If you need a different set of random numbers (which will duplicate some from the first set) then use a different key. The guarantee of uniqueness only applies with a fixed key.

9 Comments

Why would you still do it in batches with this?
The question talks about batching the numbers to help avoid overly large chunks of data. This method avoids having to cross-check between batches for duplicates: unique 'random' numbers can be generated as required, up to 2^64 - 1.
I disagree with the general-sounding "you should avoid using random" (after all, random.sample does exactly this functionality, and it does use random). But I'll upvote anyway for the creativity and because I think it works in this case (2^64 is rather special).
Random alone cannot "guarantee unique". Indeed any good RNG should be able to return the same number twice on the trot. That is why encryption, or shuffling, is often better when uniqueness is a requirement. There are ways round the 2^64 (or 2^128 for AES) boundary: use a Format Preserving Encryption to keep the numbers within the desired range.
Which is why random.sample doesn't use "random alone". Can you point out a Format Preserving Encryption algorithm which would work well here if the range were let's say [0, 10^19) instead of [0, 2^64)? (That's about the same size, just not a power of 2.) Preferably a simple one so I can understand it :-)
|
1

One way to generate random values that don't repeat is first to create a list of contiguous values

l = list(range(1000))

then shuffle it:

import random
random.shuffle(l)

You could do that several times, and save it in a file, but you'll have limited ranges since you'll never see the whole picture because of your limited memory (it's like trying to sort a big list without having the memory for it)

As someone noted, to get a wide span of random numbers, you'll need a lot of memory, so simple but not so efficient.

Another hack I just though of: do the same as above but generate a range using a step. Then in a second pass, add a random offset to the values. Even if the offset values repeat, it's guaranteed to never generate the same number twice:

import random

step = 10

l = list(range(0,1000-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]

with the required max value and number of iterations that gives:

import random

number_of_iterations = 40*10**6
max_number = 2**64
step = max_number//number_of_iterations

l = list(range(0,max_number-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]
print(len(newlist),len(set(newlist)))

runs in 1-2 minutes on my laptop, and gives 40000000 distinct values (evenly scattered across the range)

6 Comments

So, he's worried about a list of size 40 million being too big for his PC and your solution involves a list of size 18 quintillion?
I don't know how you compute your memory size, but I agree that even if my solution ensures that no number is repeated, you need to store every number up to the max limit. Well, Gives me another idea. Edited.
You mean you don't know how I got 18 quintillion? That's 2**64, his population size.
@Jean-FrançoisFabre - he probably computed the size by the fact that using your (first) approach you'd have to generate a list with all 64-bit numbers which would need, at least temporarily (before shuffling and cutting off the 40M slice) - 2^64 indices or ~18 quintillion (or in terms of real memory usage, around 2^64 * 35 bytes, or 645EB of memory).
you're both right. I know 2**64 is a big number from the story of the chessboard and the rice grains... but I never cared about the actual value. I have edited my answer, now feasible without too much memory.
|
0

Usually random number generators are not really random at all. In fact, this is quite helpful in some situations. If you want the values to be unique after the second iteration, send it a different seed value.

random.seed()

The same seed will generate the same list, so if you want the next iteration to be the same, use the same seed. if you want it to be different, use a different seed.

Comments

0

It may need High CPU and Physical memory! I suggest you to classify your data. For Example you can Save: All numbers starting with 10 and are 5 characters(Example:10365) to 10-5.txt

All numbers starting with 11 and are 6 characters(Example:114567) to 11-6.txt

Then for checking a new number: For Example my number is 9256547 It starts with 92 and is 7 characters. So I search 92-7.txt for this number and if it isn't duplicate,I will add it to 92-7.txt Finally,You can join all files together. Sorry If I have mistakes.My main language isn't English.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.