For performance reasons, the program treats boneMatrix as an array of float4 vectors rather than an array of float3x4 matrices. The matrixIndex array contains floating-point values instead of integers, and so the addressing of a single array of vectors is more efficient than accessing an array of matrices.
There is definitely a performance implication for using an array matrices, personally I can see two reasons for this. Memory layout and index calculations.
Generally speaking contiguous memory layout is much faster to access than non-contiguous memory layout. It's a common practice to flatten 2D arrays into 1D arrays. As noted the implication is that it leaks the implementation so you have to handle this fact by changing the indexing.
But I can be wrong, I suspect that the matrix implementation in Cg language actually a 1D array. So this brings us to the second point, if the implementation in Cg of a matrix is actually a 1D array the only different between using an array of matrices or array of vectors (flattened matrix) is actually the index calculation, given in the article they are using floating point to calculate the index and a single precision floating point multiply, add, and multiply-add take 4 clock cycles per warp. The arrays of vectors only needs one index the array of matrices need two, this leads to less instructions per lookup, and remember in shaders every instruction matters.
Update regarding your question in the comments:
But according to the code, they actually convert the vectors back to matrices in the for loop. Isn't that be more efficient to do something like using an array of matrices and index it directly to get the model matrix? It will be only one indexing instead of three when constructing the model matrix.
What you said makes sense, what I speculate is that the compiler will notice that boneMatrix doesn't change. So the compiler won't allocate a new matrix and just reference the old values, so it's not actually constructing a new matrix just aliasing the vectors to be able to use matrix operations. But how can we be sure? someone need to check the generated code..
Update: this has been confirmed by @EternalWind (check the comments) the compiler doesn't construct a new matrix but actually reference the vectors, moreover it was able to vectorize the operation using dot product.