Given an array of n word-frequency pairs:
[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]
where wi is a word, fi is an integer frequencey, and the sum of the frequencies ∑fi = m,
I want to use a pseudo-random number generator (pRNG) to select p words wj0, wj1, ..., wjp-1 such that
the probability of selecting any word is proportional to its frequency:
P(wi = wjk) = P(i = jk) = fi / m
(Note, this is selection with replacement, so the same word could be chosen every time).
I've come up with three algorithms so far:
Create an array of size
m, and populate it so the firstf0entries arew0, the nextf1entries arew1, and so on, so the lastfp-1entries arewp-1.[ w0, ..., w0, w1,..., w1, ..., wp-1, ..., wp-1 ]
Then use the pRNG to selectpindices in the range0...m-1, and report the words stored at those indices.
This takesO(n + m + p)work, which isn't great, sincemcan be much much larger than n.Step through the input array once, computing
mi = ∑h≤ifh = mi-1 + fi
and after computingmi, use the pRNG to generate a numberxkin the range0...mi-1for eachkin0...p-1and selectwiforwjk(possibly replacing the current value ofwjk) ifxk < fi.
This requiresO(n + np)work.- Compute
mias in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:[ (w0, f0, m0), (w1, f1, m1), ..., (wn-1, fn-1, mn-1) ]
and then, for each k in0...p-1, use the pRNG to generate a numberxkin the range0...m-1then do binary search on the array of triples to find theis.t.mi-fi ≤ xk < mi, and selectwiforwjk.
This requiresO(n + p log n)work.
My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?