2

I want to generate random strings (or arrays) of 1's and 0's. Then I classify them according to quantity (count) of 1's. I want the generated strings to be evenly distributed among the possible counts.

But the following code gives me a normal distribution:

import numpy as np

for i in range(num_examples):
    seq = np.random.randint(2, size=(seq_length)).astype('float32')
    sequences[i] = seq

target_classes = []
for input in sequences: 
    target = (input == 1).sum()
    target_classes.append(target)

The histogram of counts is:

enter image description here

A NumPy solution would be awesome. Or do I need regular expressions or something else?

7
  • 1
    "So I want the generated strings to be evenly distributed in these classes" - why do you expect them to be evenly distributed in these classes? That's like expecting 0 heads to be as likely as 50 heads in a sequence of 100 coin flips. Commented Aug 10, 2017 at 22:37
  • This seems more of a mathematical or statistical problem. Once you know the math that produces the distribution you want, converting it to Python should be straightforward. Commented Aug 10, 2017 at 22:37
  • Strings? I don't see what you mean... Commented Aug 10, 2017 at 22:37
  • 2
    OP didn't say he expects them to be evenly distributed, but that he wants them evenly distributed. Commented Aug 10, 2017 at 22:38
  • @juanpa.arrivillaga string or array, not a problem. I can convert one to another. Commented Aug 10, 2017 at 22:40

2 Answers 2

3

As @Prune already noted this is essentially a 2-step process. First you need to create an uniform distribution of the "number of ones" (for example with np.random.randint), then you need to set that many "seq" elements to one (for example using np.random.choice).

One possibility would be:

import numpy as np

NUM_EXAMPLES = 10000
SEQ_LENGTH = 10

sequences = np.zeros((NUM_EXAMPLES, SEQ_LENGTH), dtype=np.int8)
# How many number of ones in each sequence
number_of_1s = np.random.randint(0, SEQ_LENGTH+1, size=NUM_EXAMPLES)

indices = np.arange(SEQ_LENGTH)
for idx, num_ones in enumerate(number_of_1s.tolist()):
    # Set "num_ones" elements to 1 using "choice" without replace.
    sequences[idx][np.random.choice(indices, num_ones, replace=False)] = 1

Using a histogram shows that it seems to be quite evenly distributed:

plt.hist(np.sum(sequences==1, axis=1), bins=np.arange(SEQ_LENGTH+2)-0.5, histtype='step')

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

2

If you want equal distribution of the quantity of 1's, then I think you'll find it easiest to first generate the quantity, and then to randomly distribute that many 1's through the binary representation. This is a two-step process, almost by necessity.

With that hint, can you do the coding on your own?

3 Comments

No, one random call to get the quantity, then something to distribute that many 1s in a string of 0s. I don't see a role for regex in this.
I think I don't even need the random call. Just an integer array from 0 to (num_of_seq / len_of_seq). I will try to implement "something" part. Thanks.
Ah ... if you want to guarantee equal distribution, yes -- you do not need the first random call. I was thinking of a uniform random distribution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.