random numpy array whose values are between -1 and 1 and sum to 1

Question

what is the best way to create a NumPy array x of a given size with values randomly (and uniformly?) spread between -1 and 1, and that also sum to 1 ?

I tried 2*np.random.rand(size)-1 and np.random.uniform(-1,1,size) based on the discussion here, but if I take a transformation approach, by re-scaling both methods by their sum afterwards, x/=np.sum(x), this ensures the elements sum to 1, but: there are elements in the array that are suddenly much greater or less than 1 (>1, <-1) which is not wanted.

Any distribution symmetric around zero will have a sum that is either zero or something very close to it and small. Dividing by zero is invalid. Dividing numbers by small fractions make them bigger. Can you provide more context on what you're actually trying to achieve and why it's necessary? — Paul H
– Paul H, Commented Sep 14, 2020 at 21:11
@PaulH It will have a mean close to zero. I don't see why it would have to have a sum close to zero. — wim
– wim, Commented Sep 14, 2020 at 21:14
generate array x whose elements are each between -1 and 1, and whose elements sum to 1 — develarist
– develarist, Commented Sep 14, 2020 at 21:15
What you're asking for makes no mathematical sense. "randomly and uniformly spread between -1 and 1" completely determines the distribution; you can't attach another condition on top of that. — user2357112
– user2357112, Commented Sep 14, 2020 at 21:15
@PaulH I get that, but take np.random.uniform(-1, 1, 1000).sum() and you should not be surprised to see values around 50 or so. — wim
– wim, Commented Sep 14, 2020 at 21:20

Prune · Accepted Answer · 2020-09-14 23:38:07Z

1

In this case, let's let a uniform distribution start the process, but adjust the values to give a sum of 1. For sake of illustration, I'll use an initial step of [-1, -0.75, 0, 0.25, 1] This gives us a sum of -0.5, but we require 1.0

STEP 1: Compute the amount of total change needed: 1.0 - (-0.5) = 1.5.

Now, we will apportion that change among the elements of the distribution is some appropriate fashion. One simply method I've used is to move middle elements the most, while keeping the endpoints stable.

STEP 2: Compute the difference of each element from the nearer endpoint. For your nice range, this is 1 - abs(x)

STEP 3: sum these differences. Divide into the required change. That gives the amount to adjust each element.

Putting this much into a chart:

  x    diff  adjust
-1.0   0.00  0.0
-0.75  0.25  0.1875
 0.0   1.0   0.75
 0.25  0.75  0.5625
 1.0   0.0   0.0

Now, simply add the x and adjust columns to get the new values:

 x    adjust  new
-1.0  0.0     -1.0
-0.75 0.1875  -0.5625
 0    0.75     0.75
 0.25 0.5625   0.8125
 1.0  0.0      1.0

There is your adjusted data set: a sum of 1.0, the endpoints intact.

Simple python code:

x = [-1, -0.75, 0, 0.25, 1.0]
total = sum(x)
diff = [1 - abs(q) for q in x]
total_diff = sum(diff)
needed = 1.0 - sum(x)

adjust = [q * needed / total_diff for q in diff]
new = [x[i] + adjust[i] for i in range(len(x))]
for i in range(len(x)):
    print(f'{x[i]:8} {diff[i]:8} {adjust[i]:8} {new[i]:8}')
print (new, sum(new))

Output:

      -1        0      0.0     -1.0
   -0.75     0.25   0.1875  -0.5625
       0        1     0.75     0.75
    0.25     0.75   0.5625   0.8125
     1.0      0.0      0.0      1.0
[-1.0, -0.5625, 0.75, 0.8125, 1.0] 1.0

I'll let you vectorize this in NumPy.

edited Sep 14, 2020 at 23:38

answered Sep 14, 2020 at 22:16

Prune

78k14 gold badges63 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

develarist Over a year ago

name 'needed' is not defined. should it be needed = total_diff - total? When I then send [-1, -0.75, 0, 0.25, 1] to your function with my definition of needed, the sum of the corrected array is 2.

develarist Over a year ago

I've found a way to do it with a transformed Dirichlet distribution, but need help stackoverflow.com/questions/63910689/…

Prune Over a year ago

I see that other people beat me to the "obvious" analysis.

Rajesh Bhat · Accepted Answer · 2020-09-15 04:39:40Z

1

You can create two different arrays for positive and negative values. Make sure the positive side adds up to 1 and negative side adds up to 0.

import numpy as np
size = 10
x_pos = np.random.uniform(0, 1, int(np.floor(size/2)))
x_pos = x_pos/x_pos.sum() 
x_neg = np.random.uniform(0, 1, int(np.ceil(size/2)))
x_neg = x_neg - x_neg.mean()

x = np.concatenate([x_pos, x_neg])
np.random.shuffle(x)

print(x.sum(), x.max(), x.min())
>>> 0.9999999999999998 0.4928358768227867 -0.3265210342316333

print(x)
>>>[ 0.49283588  0.33974127 -0.26079784  0.28127281  0.23749531 -0.32652103
  0.12651658  0.01497403 -0.03823131  0.13271431]

edited Sep 15, 2020 at 4:39

answered Sep 15, 2020 at 4:23

Rajesh Bhat

1,0006 silver badges8 bronze badges

5 Comments

develarist Over a year ago

there are no negative numbers in x_neg though. code for x_neg = x_pos

Rajesh Bhat Over a year ago

while concatenating i am taking -x_neg.

develarist Over a year ago

it doesn't work when size = 5 or less. the max and min of 1 and -1 are breached.

Rajesh Bhat Over a year ago

On second thought, scaling up to 2 will make sure that the positive side has large values that could exceed 1. So have two arrays, one with positve values that add up to 1, and another one centered around its mean so that it adds up to 0

develarist Over a year ago

I've found a way to do it with a transformed Dirichlet distribution, but need help stackoverflow.com/questions/63910689/…

Sextus Empiricus · Accepted Answer · 2020-09-16 20:04:43Z

Rejection sampling

You can use rejection sampling. The method below does this by sampling in a space of 1 dimension less than the original space.

Step 1: you sample x(1), x(2), ..., x(n-1) randomly by sampling each x(i) from a uniform distribution
Step 2: if the sum S = x(1) + x(2) + ... + x(n-1) is below 0 or above 2 then reject and start again in Step 1.
Step 3: compute the n-th variable as x(n) = 1-S

Intuition

You can view the vector x(1), x(2), ..., x(n-1), x(n) on the interior of a n-dimensional cube with cartesian coordinates ±1, ±1,... , ±1. Such that you follow the constraints -1 <= x(i) <= 1.

The additional constraint that the sum of the coordinates must equal 1 constrains the coordinates to a smaller space than the hypercube and will be a hyperplane with dimension n-1.

If you do regular rejection sampling, sampling from uniform distribution for all the coordinates, then you will never hit the constraint. The sampled point will never be in the hyperplane. Therefore you consider a subspace of n-1 coordinates. Now you can use rejection sampling.

Visually

Say you have dimension 4 then you could plot 3 of the coordinated from the 4. This plot (homogeneously) fills a polyhedron. Below this is illustrated by plotting the polyhedron in slices. Each slice corresponds to a different sum S = x(1) + x(2) + ... + x(n-1) and a different value for x(n).

Image: domain for 3 coordinates. Each colored surface relates to a different value for the 4-th coordinate.

Marginal distributions

For large dimensions, rejection sampling will become less efficient because the fraction of rejections grows with the number of dimensions.

One way to 'solve' this would be by sampling from the marginal distributions. However, it is a bit tedious to compute these marginal distributions. Comparison: For generating samples from a Dirichlet distribution a similar algorithm exists, but in that case, the marginal distributions are relatively easy. (however, it is not impossible to derive these distributions, see below 'relationship with Irwin Hall distribution')

In the example above the marginal distribution of the x(4) coordinate corresponds to the surface area of the cuts. So for 4 dimensions, you might be able to figure out the computation based on that figure (you'd need to compute the area of those irregular polygons) but it starts to get more complicated for larger dimensions.

Relationship with Irwin Hall distribution

To get the marginal distributions you can use truncated Irwin Hall distributions. The Irwin Hall distribution is is the distribution of a sum of uniform distributed variables and will follow some piecewise polynomial shape. This is demonstrated below for one example.

Code

Since my python is rusty I will mostly add R code. The algorithm is very basic and so I imagine that any Python coder can easily adapt it into Python code. The hard part of the question seems to me to be more about the algorithm than about how to code in Python (although I am not a Python coder so I leave that up to others).

Image: output from sampling. The 4 black curves are marginal distributions for the four coordinates. The red curve is a computation based on an Irwin Hall distribution. This can be extended to a sampling method by computing directly instead of rejection sampling.

The rejection sampling in python

import numpy as np

def sampler(size):
   reject = 1
   while reject:
      x = np.random.rand(size - 1) # step 1
      S = np.sum(x)
      reject = (S<0) or (S>2)      # step 2
   x = np.append(x,1-S)            # step 3
   return[x]

y = sampler(5) 
print(y, np.sum(y))

Some more code in R, including the comparison with the Irwin Hall distribution. This distribution can be used to compute the marginal distributions and can be used to devise an algorithm to that is more efficient than rejection sampling.

### function to do rejection sample
samp <- function(n) {
  S <- -1
  ## a while loop that performs step 1 (sample) and 2 (compare sum)
  while((S<0) || (S>2) ) { 
    x <- runif(n-1,-1,1)
    S <- sum(x)
  }
  x <- c(x,1-S) ## step 3 (generate n-th coordinate)
  x
}

### compute 10^5 samples
y <- replicate(10^5,samp(4))

### plot histograms
h1 <- hist(y[1,], breaks = seq(-1,1,0.05))
h2 <- hist(y[2,], breaks = seq(-1,1,0.05))
h3 <- hist(y[3,], breaks = seq(-1,1,0.05))
h4 <- hist(y[4,], breaks = seq(-1,1,0.05))

### histograms together in a line plot
plot(h1$mids,h1$density, type = 'l', ylim = c(0,1),
     xlab = "x[i]", ylab = "frequency", main = "marginal distributions")
lines(h2$mids,h2$density)
lines(h3$mids,h3$density)
lines(h4$mids,h4$density)

### add distribution based on Irwin Hall distribution

### Irwin Hall PDF
dih <- function(x,n=3) {
  k <- 0:(floor(x))   
  terms <- (-1)^k * choose(n,k) *(x-k)^(n-1)
  sum(terms)/prod(1:(n-1))
}
dih <- Vectorize(dih)

### Irwin Hall CDF
pih <- function(x,n=3) {
  k <- 0:(floor(x))   
  terms <- (-1)^k * choose(n,k) *(x-k)^n
  sum(terms)/prod(1:(n))
}
pih <- Vectorize(pih)


### adding the line 
### (note we need to scale the variable for the Erwin Hall distribution)
xn <- seq(-1,1,0.001)

range <- c(-1,1)
cum <- pih(1.5+(1-range)/2,3)
scale <- 0.5/(cum[1]-cum[2]) ### renormalize
                           ### (the factor 0.5 is due to the scale difference)
lines(xn,scale*dih(1.5+(1-xn)/2,3),col = 2)

sidenote: uniform is not well defined, but I assumed a constant probability density on the hyperplane in terms of probability mass per euclidean volume element dx(1)*dx(2)*...*dx(n). This is a bit tricky to visualize because you do not integrate over the volume of the hypercube but instead over the hyperplane. E.g. imagine the simple case of a homogeneous density on a straight 1D line embedded in a 2D space.
This is no longer stats stack, so how can rejection sampling be coded for the $[-1,1]$ and sum $1$ request
Maybe this question would be more on-topic on stats stack. It is a very basic algorithm but I am not so good in python. I will add my R-code. It should be straightforward to turn it into python.
ok, but the question was also asked on stats with no results. This here is its coding counterpart, indicated by the attempts at coding in the question itself
The question on stats stackedchange is different. That question did not specify that the distribution had to be uniform. That addition makes it much easier to provide an answer, because without that specification there are infinitely many possibilities. (Actually there are still infinitely many possibilities because of the ambiguity in the term 'uniform', but this is the simplest approach)

Prune · Accepted Answer · 2020-09-14 21:32:37Z

0

You have coded an algebraic contradiction. The assumption of the question you cite is that the random sample will approximately fill the range [-1, 1]. If you re-scale linearly, it is algebraically impossible to maintain that range unless the sum is 1 before scaling, such that the scaling makes no changes.

You have two immediate choices here:

Surrender the range idea. Make a simple change to ensure that the sum will be at least 1, and accept a smaller range after scaling. You can do this in any way you like that skews the choices toward the positive side.
Change your original "random" selection algorithm such that it tends to maintain a sum near to 1, and then add a final element that returns it to exactly 1.0. Then you don't have to re-scale.

Consider basic interval algebra. If you begin with the interval (range) of [-1,1] and multiply by a (which would be 1/sum(x) for you), then the resulting interval is [-a,a]. If a > 1, as in your case, the resulting interval is larger. If a < 0, then the ends of the interval are swapped.

From your comments, I infer that your conceptual problem is a bit more subtle. You are trying to force a distribution with an expected value of 0 to yield a sum of 1. This is unrealistic until you agree to somehow skew that distribution without certain bounds. So far, you have declined my suggestions, but have not offered anything you will accept. Until you identify that, I cannot reasonably suggest a solution for you.

edited Sep 14, 2020 at 21:32

answered Sep 14, 2020 at 21:17

Prune

78k14 gold badges63 silver badges83 bronze badges

7 Comments

develarist Over a year ago

the requirement for sum to 1 has to be exact, and adding on a final element is too rag-tag

develarist Over a year ago

the priority is on output (after re-scaling), rather than input (before re-scaling), meeting the two requirements. the requirements do not have to be fulfilled for the input, just the output, if that makes it easier

Prune Over a year ago

It's no easier; the linear transformation can be applied after the initial generation, or simply incorporated into the original process.

develarist Over a year ago

let me know if i should remove the transformation step and edit the question to, more directly, "how to generate a random array whose values are between -1 and 1 and sum to 1"

develarist Over a year ago

alright i will. i completely understand your suggestions and have seen it done

|

Collectives™ on Stack Overflow

random numpy array whose values are between -1 and 1 and sum to 1

4 Answers 4

3 Comments

5 Comments

Rejection sampling

Intuition

Visually

Marginal distributions

Relationship with Irwin Hall distribution

Code

5 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

5 Comments

Rejection sampling

Intuition

Visually

Marginal distributions

Relationship with Irwin Hall distribution

Code

5 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related