Efficient algorithm to randomly select items with frequency

Question

Given an array of n word-frequency pairs:

[ (w₀, f₀), (w₁, f₁), ..., (w_n-1, f_n-1) ]

where w_i is a word, f_i is an integer frequencey, and the sum of the frequencies ∑f_i = m,

I want to use a pseudo-random number generator (pRNG) to select p words w_j₀, w_j₁, ..., w_{j_p-1} such that the probability of selecting any word is proportional to its frequency:

P(w_i = w_{j_k}) = P(i = j_k) = f_i / m

(Note, this is selection with replacement, so the same word could be chosen every time).

I've come up with three algorithms so far:

Create an array of size m, and populate it so the first f₀ entries are w₀, the next f₁ entries are w₁, and so on, so the last f_p-1 entries are w_p-1.
```
[ w₀, ..., w₀, w₁,..., w₁, ..., w_p-1, ..., w_p-1 ]
```
Then use the pRNG to select p indices in the range 0...m-1, and report the words stored at those indices.
This takes O(n + m + p) work, which isn't great, since m can be much much larger than n.
Step through the input array once, computing
```
m_i = ∑_h≤if_h = m_i-1 + f_i
```
and after computing m_i, use the pRNG to generate a number x_k in the range 0...m_i-1 for each k in 0...p-1 and select w_i for w_{j_k} (possibly replacing the current value of w_{j_k}) if x_k < f_i.
This requires O(n + np) work.
Compute m_i as in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:
```
[ (w₀, f₀, m₀), (w₁, f₁, m₁), ..., (w_n-1, f_n-1, m_n-1) ]
```
and then, for each k in 0...p-1, use the pRNG to generate a number x_k in the range 0...m-1 then do binary search on the array of triples to find the i s.t. m_i-f_i ≤ x_k < m_i, and select w_i for w_{j_k}.
This requires O(n + p log n) work.

My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?

this is OT, and please don't kill me for this, but how did you get sub/super scripts, and the sum equation signs? — dassouki
– dassouki, Commented May 16, 2009 at 15:25
Just use <sub>...</sub> inside <code>...</code> blocks (for inline) or <pre>...</pre> blocks (for fullline). — rampion
– rampion, Commented May 16, 2009 at 15:34
And for the sum sign, just use ∑ (see w3.org/TR/WD-entities-961125 for more html entities for math sigils) — rampion
– rampion, Commented May 16, 2009 at 15:36
BTW when performance is irrelevant here's copy and paste code to save you typing stackoverflow.com/a/33991225/294884 — Fattie
– Fattie, Commented Nov 30, 2015 at 4:01
note that algo 1 is of course spectacularly more efficient, assuming you do not count the time to assemble the array to begin with (ie, if you do that only once at development time). — Fattie
– Fattie, Commented Nov 30, 2015 at 4:03

Community · Accepted Answer · 2017-05-23 12:07:10Z

6

This sounds like roulette wheel selection, mainly used for the selection process in genetic/evolutionary algorithms.

Look at Roulette Selection in Genetic Algorithms

edited May 23, 2017 at 12:07

CommunityBot

11 silver badge

answered May 16, 2009 at 15:06

seb

1,6081 gold badge10 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Noldorin Over a year ago

Yeah, this is exactly what algorithm is required. You're not going to get quicker than O(n) complexity for sure.

rampion Over a year ago

Ok. They're just using iterative search, which requires O(n log m) to select each, and a total work of O(n log m + pn log m), just like my algorithm 2. Thanks!

Karoly Horvath Over a year ago

with binary search it's O(n + p * log n). Why do you have m there? It doesn't effect the algorithm complexity.

Guffa · Accepted Answer · 2009-05-16 15:54:48Z

2

You could create the target array, then loop through the words determining the probability that it should be picked, and replace the words in the array according to a random number.

For the first word the probability would be f₀/m₀ (where m_n=f₀+..+f_n), i.e. 100%, so all positions in the target array would be filled with w₀.

For the following words the probability falls, and when you reach the last word the target array is filled with randomly picked words accoding to the frequency.

Example code in C#:

public class WordFrequency {

    public string Word { get; private set; }
    public int Frequency { get; private set; }

    public WordFrequency(string word, int frequency) {
        Word = word;
        Frequency = frequency;
    }

}

WordFrequency[] words = new WordFrequency[] {
    new WordFrequency("Hero", 80),
    new WordFrequency("Monkey", 4),
    new WordFrequency("Shoe", 13),
    new WordFrequency("Highway", 3),
};

int p = 7;
string[] result = new string[p];
int sum = 0;
Random rnd = new Random();
foreach (WordFrequency wf in words) {
    sum += wf.Frequency;
    for (int i = 0; i < p; i++) {
        if (rnd.Next(sum) < wf.Frequency) {
            result[i] = wf.Word;
        }
    }
}

answered May 16, 2009 at 15:54

Guffa

703k111 gold badges760 silver badges1k bronze badges

5 Comments

rampion Over a year ago

Right. This is exactly algorithm 2.

Guffa Over a year ago

Is that what you meant? I was thrown off by the O() calculation. The frequency values are irrelevant for how much work there is, so the m has no business in the O() value. It should simply be O(np).

rampion Over a year ago

No, the frequency values matter - it takes O(log m) bits to store a frequency, and O(log m) work to add two frequencies or compare two. Usually this is just swallowed by a constant term when log m < 64 (you store it in a 64 bit int), but for larger numbers, it can matter.

Guffa Over a year ago

If you want that kind of complexity, then you have to consider the data size for every operation... Looping through the pairs is not an O(n) operation, but an O(n log n) operation... Creating an array is not an O(p) operation, but an O(p log p) operation...

rampion Over a year ago

good point. I'll adjust my complexity descriptions accordingly.

8 revs, 2 users 94% · Accepted Answer · 2017-05-23 11:48:38Z

Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:

There are n partitions, all of the same width r s.t. nr = m.
each partition contains two words in some ratio (which is stored with the partition).
for each word w_i, f_i = ∑_{partitions t s.t w_i ∈ t} r × ratio(t,w_i)

Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1 at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p selections can be done in O(p) work, given such a partition.

The reason that such a partitioning exists is that there exists a word w_i s.t. f_i < r, if and only if there exists a word w_i' s.t. f_i' > r, since r is the average of the frequencies.

Given such a pair w_i and w_i' we can replace them with a pseudo-word w'_i of frequency f'_i = r (that represents w_i with probability f_i/r and w_i' with probability 1 - f_i/r) and a new word w'_i' of adjusted frequency f'_i' = f_i' - (r - f_i) respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.

To construct this partition in O(n) time,

go through the list of the words once, constructing two lists:
- one of words with frequency ≤ r
- one of words with frequency > r
then pull a word from the first list
- if its frequency = r, then make it into a one element partition
- otherwise, pull a word from the other list, and use it to fill out a two-word partition. Then put the second word back into either the first or second list according to its adjusted frequency.

This actually still works if the number of partitions q > n (you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q of m s.t. q > n, you can pad all the frequencies by a factor of n, so f'_i = nf_i, which updates m' = mn and sets r' = m when q = n.

In any case, this algorithm only takes O(n + p) work, which I have to think is optimal.

In ruby:

def weighted_sample_with_replacement(input, p)
  n = input.size
  m = input.inject(0) { |sum,(word,freq)| sum + freq }

  # find the words with frequency lesser and greater than average
  lessers, greaters = input.map do |word,freq| 
                        # pad the frequency so we can keep it integral
                        # when subdivided
                        [ word, freq*n ] 
                      end.partition do |word,adj_freq| 
                        adj_freq <= m 
                      end

  partitions = Array.new(n) do
    word, adj_freq = lessers.shift

    other_word = if adj_freq < m
                   # use part of another word's frequency to pad
                   # out the partition
                   other_word, other_adj_freq = greaters.shift
                   other_adj_freq -= (m - adj_freq)
                   (other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
                   other_word
                 end

    [ word, other_word , adj_freq ]
  end

  (0...p).map do 
    # pick a partition at random
    word, other_word, adj_freq = partitions[ rand(n) ]
    # select the first word in the partition with appropriate
    # probability
    if rand(m) < adj_freq
      word
    else
      other_word
    end
  end
end

Collectives™ on Stack Overflow

Efficient algorithm to randomly select items with frequency

3 Answers 3

3 Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related