Sampling from a discrete probability distribution algorithm

Question

I'm solving a task from Sedgewick book:

Write a class Sample with a constructor that takes an array p[] of double values as argument and supports the fol-lowing two operations: random() —return an index i with probability p[i]/T (where T is the sum of the numbers in p[])

I think that there is a simple solution: store all border values in array and find first value that is lower than random sample, for example, we have (value, weight) pairs: (1, 10.0), (2, 20.0), (3, 10.0), (4, 10.0). We convert it to (1, 0.0), (2, 10.0), (3, 30.0), (4, 40), sample random value [0-50] (for example, 35) and find that it is > 30, so answer is '3'.

But in the book there is a suggestion:

Use a complete binary tree where each node has implied weight p[i]. Store in each node the cumulative weight of all the nodes in its subtree. To generate a random index, pick a random number between 0 and T and use the cumulative weights to determine which branch of the subtree to explore.

and I saw this solution on github: https://github.com/reneargento/algorithms-sedgewick-wayne/blob/master/src/chapter2/section4/Exercise35_Sampling.java

But I don't understand why it works: instead of representing ranges we will have some tree that will have nodes like (3, 10), (4, 10), how searching for 'closest' node to random sample will help to find a correct node?

Gene · Accepted Answer · 2019-05-11 21:00:02Z

Your idea is on the right track, but not quite there. You want to do inverse transform sampling. The stepwise function you're thinking of is the inverse cumulative density function (cdf) of the given discrete distribution. It's more conventional to write it with the search value on the X axis over the interval [0..1). The weights are 1/5, 2/5, 1/5, 1/5 respectively for 1, 2, 3, and 4. You want to split the interval into pieces of this size and map these intervals to their respective values:

[0   .. 1/5) ->  1   // Note interval widths are 1/5,2/5,1/5,1/5 as desired.
[1/5 .. 3/5) ->  2
[3/5 .. 4/5) ->  3
[4/5 ..   1) ->  4

As you say, it's sufficient to store the tops of the intervals along with their values in an array. In C,

struct IntervalTop {
  double r;
  int value;
} cdf[] = {{.2, 1}, {.6, 2}, {.8, 3}, {1.0, 4}};

Now generate a random value in [0..1) and look up the respective sub-interval to determine the value. For example, 0.1 is in the first interval, so return 1. 0.7 is in the third interval, so return 3. A simple linear search works okay for starters:

double r = ... // Compute random double 0.0 <= r < 1.0 .
for (int i = 0; ; ++i)
  if (cdf[i].r > r) 
     return cdf[i].value;

But with this, search time grows with the number of intervals.

A simple way to improve performance is replace the loop with binary search. Then the search time grows as the log of the number of intervals.

But Sedgewick wants you to work harder, presumably for learning purposes.

His proposal also has run time O(log(n)), but it's more complicated. He's saying use a complete binary search tree. Each node contains the value, the weight (w), and also the sum of all weights (t) in the subtree rooted at the node. So for this problem, ...

                  ____3(w=1/5,t=1)____
                 /                     \
        2(w=2/5,t=3/5)           4 (w=1/5,t=1/5)
        /
1(w=1/5,t=1/5)

Actually, you don't need the weights for the algorithm (that's why S says they're "implicit"), but including them here them makes it easier to see what's happening.

You'll generate a random number r in [0..1) as above, but here you'll search the tree instead, using the value of r as a guide.

You'll do this by looking at tree.t, tree.left.t, and tree.right.t (a missing child is equivalent to the .t value being zero) and using these values to make the same decision that the binary search would have above.

I'll stop here so your fun isn't spoiled.

Collectives™ on Stack Overflow

Sampling from a discrete probability distribution algorithm

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related