The work $W(n)$ is the total number of nodes in your computation
graph and the span $D(n)$ is the number of nodes on the longest path
of that graph. $T_P(n)$ is the runtime of the algorithm
using $P$ processors; $T_1(n) = W(n)$ and $T_\inf(n) = D(n)$.
Suppose your code has the following shape
parfor i in range(n):
parfor j in range(n):
f()
where $f() \in O(1)$ is some function (with side-effects). The keyword
"parfor" means that every loop iteration is independent. There are
$n^2$ independent iterations so in theory $W(n)\in O(n^2)$ and
$T_\inf(n) = D(n)\in O(1)$.
However, the work of starting each $f()$ thread and waiting for its
result must be accounted for. That is, we must implement the "parfor"
keyword. Here is one way:
parfor_impl(f, lo, hi):
if hi - lo <= 1:
f()
else:
mid = (lo + hi) // 2
spawn parfor_impl(f, lo, mid)
spawn parfor_impl(f, mid, hi)
wait
where the keywords "spawn" and "wait" launches threads and waits for
previously launched threads' completion respectively. This is a
classical recursive scheme with $W_{parfor} \in O(n)$ and $D_{parfor}
\in O(\log n)$. To get the total runtime (and span) we just add the
runtime of the two nested loops $O(2\log n) = O(\log n)$. We do not
multiply the runtime since the loops run in parallel.
I can highly recommend the "Intro to the Work-Span Model" chapter of
Udacity's High Performance
Computing course which
delves deep into this topic.
parallel-for i, j, k = 1 to n do C[i, j, k] = A[i, k] * B[k, j]$\endgroup$