1

what is the best way to create a NumPy array x of a given size with values randomly (and uniformly?) spread between -1 and 1, and that also sum to 1 ?

I tried 2*np.random.rand(size)-1 and np.random.uniform(-1,1,size) based on the discussion here, but if I take a transformation approach, by re-scaling both methods by their sum afterwards, x/=np.sum(x), this ensures the elements sum to 1, but: there are elements in the array that are suddenly much greater or less than 1 (>1, <-1) which is not wanted.

11
  • 2
    Any distribution symmetric around zero will have a sum that is either zero or something very close to it and small. Dividing by zero is invalid. Dividing numbers by small fractions make them bigger. Can you provide more context on what you're actually trying to achieve and why it's necessary? Commented Sep 14, 2020 at 21:11
  • @PaulH It will have a mean close to zero. I don't see why it would have to have a sum close to zero. Commented Sep 14, 2020 at 21:14
  • generate array x whose elements are each between -1 and 1, and whose elements sum to 1 Commented Sep 14, 2020 at 21:15
  • 3
    What you're asking for makes no mathematical sense. "randomly and uniformly spread between -1 and 1" completely determines the distribution; you can't attach another condition on top of that. Commented Sep 14, 2020 at 21:15
  • 1
    @PaulH I get that, but take np.random.uniform(-1, 1, 1000).sum() and you should not be surprised to see values around 50 or so. Commented Sep 14, 2020 at 21:20

4 Answers 4

1

In this case, let's let a uniform distribution start the process, but adjust the values to give a sum of 1. For sake of illustration, I'll use an initial step of [-1, -0.75, 0, 0.25, 1] This gives us a sum of -0.5, but we require 1.0

STEP 1: Compute the amount of total change needed: 1.0 - (-0.5) = 1.5.

Now, we will apportion that change among the elements of the distribution is some appropriate fashion. One simply method I've used is to move middle elements the most, while keeping the endpoints stable.

STEP 2: Compute the difference of each element from the nearer endpoint. For your nice range, this is 1 - abs(x)

STEP 3: sum these differences. Divide into the required change. That gives the amount to adjust each element.

Putting this much into a chart:

  x    diff  adjust
-1.0   0.00  0.0
-0.75  0.25  0.1875
 0.0   1.0   0.75
 0.25  0.75  0.5625
 1.0   0.0   0.0

Now, simply add the x and adjust columns to get the new values:

 x    adjust  new
-1.0  0.0     -1.0
-0.75 0.1875  -0.5625
 0    0.75     0.75
 0.25 0.5625   0.8125
 1.0  0.0      1.0

There is your adjusted data set: a sum of 1.0, the endpoints intact.


Simple python code:

x = [-1, -0.75, 0, 0.25, 1.0]
total = sum(x)
diff = [1 - abs(q) for q in x]
total_diff = sum(diff)
needed = 1.0 - sum(x)

adjust = [q * needed / total_diff for q in diff]
new = [x[i] + adjust[i] for i in range(len(x))]
for i in range(len(x)):
    print(f'{x[i]:8} {diff[i]:8} {adjust[i]:8} {new[i]:8}')
print (new, sum(new))

Output:

      -1        0      0.0     -1.0
   -0.75     0.25   0.1875  -0.5625
       0        1     0.75     0.75
    0.25     0.75   0.5625   0.8125
     1.0      0.0      0.0      1.0
[-1.0, -0.5625, 0.75, 0.8125, 1.0] 1.0

I'll let you vectorize this in NumPy.

Sign up to request clarification or add additional context in comments.

3 Comments

name 'needed' is not defined. should it be needed = total_diff - total? When I then send [-1, -0.75, 0, 0.25, 1] to your function with my definition of needed, the sum of the corrected array is 2.
I've found a way to do it with a transformed Dirichlet distribution, but need help stackoverflow.com/questions/63910689/…
I see that other people beat me to the "obvious" analysis.
1

You can create two different arrays for positive and negative values. Make sure the positive side adds up to 1 and negative side adds up to 0.

import numpy as np
size = 10
x_pos = np.random.uniform(0, 1, int(np.floor(size/2)))
x_pos = x_pos/x_pos.sum() 
x_neg = np.random.uniform(0, 1, int(np.ceil(size/2)))
x_neg = x_neg - x_neg.mean()

x = np.concatenate([x_pos, x_neg])
np.random.shuffle(x)

print(x.sum(), x.max(), x.min())
>>> 0.9999999999999998 0.4928358768227867 -0.3265210342316333

print(x)
>>>[ 0.49283588  0.33974127 -0.26079784  0.28127281  0.23749531 -0.32652103
  0.12651658  0.01497403 -0.03823131  0.13271431]

5 Comments

there are no negative numbers in x_neg though. code for x_neg = x_pos
while concatenating i am taking -x_neg.
it doesn't work when size = 5 or less. the max and min of 1 and -1 are breached.
On second thought, scaling up to 2 will make sure that the positive side has large values that could exceed 1. So have two arrays, one with positve values that add up to 1, and another one centered around its mean so that it adds up to 0
I've found a way to do it with a transformed Dirichlet distribution, but need help stackoverflow.com/questions/63910689/…
1

Rejection sampling

You can use rejection sampling. The method below does this by sampling in a space of 1 dimension less than the original space.

  • Step 1: you sample x(1), x(2), ..., x(n-1) randomly by sampling each x(i) from a uniform distribution
  • Step 2: if the sum S = x(1) + x(2) + ... + x(n-1) is below 0 or above 2 then reject and start again in Step 1.
  • Step 3: compute the n-th variable as x(n) = 1-S

Intuition

You can view the vector x(1), x(2), ..., x(n-1), x(n) on the interior of a n-dimensional cube with cartesian coordinates ±1, ±1,... , ±1. Such that you follow the constraints -1 <= x(i) <= 1.

The additional constraint that the sum of the coordinates must equal 1 constrains the coordinates to a smaller space than the hypercube and will be a hyperplane with dimension n-1.

If you do regular rejection sampling, sampling from uniform distribution for all the coordinates, then you will never hit the constraint. The sampled point will never be in the hyperplane. Therefore you consider a subspace of n-1 coordinates. Now you can use rejection sampling.

Visually

Say you have dimension 4 then you could plot 3 of the coordinated from the 4. This plot (homogeneously) fills a polyhedron. Below this is illustrated by plotting the polyhedron in slices. Each slice corresponds to a different sum S = x(1) + x(2) + ... + x(n-1) and a different value for x(n).

domain for 3 coordinates

Image: domain for 3 coordinates. Each colored surface relates to a different value for the 4-th coordinate.

Marginal distributions

For large dimensions, rejection sampling will become less efficient because the fraction of rejections grows with the number of dimensions.

One way to 'solve' this would be by sampling from the marginal distributions. However, it is a bit tedious to compute these marginal distributions. Comparison: For generating samples from a Dirichlet distribution a similar algorithm exists, but in that case, the marginal distributions are relatively easy. (however, it is not impossible to derive these distributions, see below 'relationship with Irwin Hall distribution')

In the example above the marginal distribution of the x(4) coordinate corresponds to the surface area of the cuts. So for 4 dimensions, you might be able to figure out the computation based on that figure (you'd need to compute the area of those irregular polygons) but it starts to get more complicated for larger dimensions.

Relationship with Irwin Hall distribution

To get the marginal distributions you can use truncated Irwin Hall distributions. The Irwin Hall distribution is is the distribution of a sum of uniform distributed variables and will follow some piecewise polynomial shape. This is demonstrated below for one example.

Code

Since my python is rusty I will mostly add R code. The algorithm is very basic and so I imagine that any Python coder can easily adapt it into Python code. The hard part of the question seems to me to be more about the algorithm than about how to code in Python (although I am not a Python coder so I leave that up to others).

example with marginal distributions

Image: output from sampling. The 4 black curves are marginal distributions for the four coordinates. The red curve is a computation based on an Irwin Hall distribution. This can be extended to a sampling method by computing directly instead of rejection sampling.

The rejection sampling in python

import numpy as np

def sampler(size):
   reject = 1
   while reject:
      x = np.random.rand(size - 1) # step 1
      S = np.sum(x)
      reject = (S<0) or (S>2)      # step 2
   x = np.append(x,1-S)            # step 3
   return[x]

y = sampler(5) 
print(y, np.sum(y))

Some more code in R, including the comparison with the Irwin Hall distribution. This distribution can be used to compute the marginal distributions and can be used to devise an algorithm to that is more efficient than rejection sampling.

### function to do rejection sample
samp <- function(n) {
  S <- -1
  ## a while loop that performs step 1 (sample) and 2 (compare sum)
  while((S<0) || (S>2) ) { 
    x <- runif(n-1,-1,1)
    S <- sum(x)
  }
  x <- c(x,1-S) ## step 3 (generate n-th coordinate)
  x
}

### compute 10^5 samples
y <- replicate(10^5,samp(4))

### plot histograms
h1 <- hist(y[1,], breaks = seq(-1,1,0.05))
h2 <- hist(y[2,], breaks = seq(-1,1,0.05))
h3 <- hist(y[3,], breaks = seq(-1,1,0.05))
h4 <- hist(y[4,], breaks = seq(-1,1,0.05))

### histograms together in a line plot
plot(h1$mids,h1$density, type = 'l', ylim = c(0,1),
     xlab = "x[i]", ylab = "frequency", main = "marginal distributions")
lines(h2$mids,h2$density)
lines(h3$mids,h3$density)
lines(h4$mids,h4$density)

### add distribution based on Irwin Hall distribution

### Irwin Hall PDF
dih <- function(x,n=3) {
  k <- 0:(floor(x))   
  terms <- (-1)^k * choose(n,k) *(x-k)^(n-1)
  sum(terms)/prod(1:(n-1))
}
dih <- Vectorize(dih)

### Irwin Hall CDF
pih <- function(x,n=3) {
  k <- 0:(floor(x))   
  terms <- (-1)^k * choose(n,k) *(x-k)^n
  sum(terms)/prod(1:(n))
}
pih <- Vectorize(pih)


### adding the line 
### (note we need to scale the variable for the Erwin Hall distribution)
xn <- seq(-1,1,0.001)

range <- c(-1,1)
cum <- pih(1.5+(1-range)/2,3)
scale <- 0.5/(cum[1]-cum[2]) ### renormalize
                           ### (the factor 0.5 is due to the scale difference)
lines(xn,scale*dih(1.5+(1-xn)/2,3),col = 2)

5 Comments

sidenote: uniform is not well defined, but I assumed a constant probability density on the hyperplane in terms of probability mass per euclidean volume element dx(1)*dx(2)*...*dx(n). This is a bit tricky to visualize because you do not integrate over the volume of the hypercube but instead over the hyperplane. E.g. imagine the simple case of a homogeneous density on a straight 1D line embedded in a 2D space.
This is no longer stats stack, so how can rejection sampling be coded for the $[-1,1]$ and sum $1$ request
Maybe this question would be more on-topic on stats stack. It is a very basic algorithm but I am not so good in python. I will add my R-code. It should be straightforward to turn it into python.
ok, but the question was also asked on stats with no results. This here is its coding counterpart, indicated by the attempts at coding in the question itself
The question on stats stackedchange is different. That question did not specify that the distribution had to be uniform. That addition makes it much easier to provide an answer, because without that specification there are infinitely many possibilities. (Actually there are still infinitely many possibilities because of the ambiguity in the term 'uniform', but this is the simplest approach)
0

You have coded an algebraic contradiction. The assumption of the question you cite is that the random sample will approximately fill the range [-1, 1]. If you re-scale linearly, it is algebraically impossible to maintain that range unless the sum is 1 before scaling, such that the scaling makes no changes.

You have two immediate choices here:

  1. Surrender the range idea. Make a simple change to ensure that the sum will be at least 1, and accept a smaller range after scaling. You can do this in any way you like that skews the choices toward the positive side.
  2. Change your original "random" selection algorithm such that it tends to maintain a sum near to 1, and then add a final element that returns it to exactly 1.0. Then you don't have to re-scale.

Consider basic interval algebra. If you begin with the interval (range) of [-1,1] and multiply by a (which would be 1/sum(x) for you), then the resulting interval is [-a,a]. If a > 1, as in your case, the resulting interval is larger. If a < 0, then the ends of the interval are swapped.


From your comments, I infer that your conceptual problem is a bit more subtle. You are trying to force a distribution with an expected value of 0 to yield a sum of 1. This is unrealistic until you agree to somehow skew that distribution without certain bounds. So far, you have declined my suggestions, but have not offered anything you will accept. Until you identify that, I cannot reasonably suggest a solution for you.

7 Comments

the requirement for sum to 1 has to be exact, and adding on a final element is too rag-tag
the priority is on output (after re-scaling), rather than input (before re-scaling), meeting the two requirements. the requirements do not have to be fulfilled for the input, just the output, if that makes it easier
It's no easier; the linear transformation can be applied after the initial generation, or simply incorporated into the original process.
let me know if i should remove the transformation step and edit the question to, more directly, "how to generate a random array whose values are between -1 and 1 and sum to 1"
alright i will. i completely understand your suggestions and have seen it done
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.