Why is addressing an array of vectors more efficient than addressing an array of matrices in Cg?

Question

According to Nvidia's Cg tutorial (in the note section right under 6.5.2), addressing an array of vectors seems to be more efficient than addressing an array of matrices. The reason it mentions is because the index is floating point value instead of integer.

Could anyone explain this a little bit?

Here is the quote:

For performance reasons, the program treats boneMatrix as an array of float4 vectors rather than an array of float3x4 matrices. ThematrixIndex` array contains floating-point values instead of integers, and so the addressing of a single array of vectors is more efficient than accessing an array of matrices. The implication of this is that the indices in the matrixIndex vector should be three times the actual matrix index. So, the program assumes 0 is the first matrix in the array, 3 is the second matrix, and so on. The indices are fixed for each vertex, so you improve performance by moving this "multiply by 3" outside the vertex program.

And here is the Cg program it's referring to:

// Example 6-5. The C6E5v_skin4m Vertex Program
void C6E5v_skin4m(float3   position    : POSITION,
                  float3   normal      : NORMAL,
                  float2   texCoord    : TEXCOORD0,
                  float4   weight      : TEXCOORD1,
                  float4   matrixIndex : TEXCOORD2,
              out float4   oPosition   : POSITION,
              out float2   oTexCoord   : TEXCOORD0,
              out float4   color       : COLOR,
          uniform Light    light,
          uniform float4   boneMatrix[72], // 24 matrices
          uniform float4x4 modelViewProj)
{
  float3 netPosition = 0, netNormal = 0;

  for (int i = 0; i < 4; i++) {
    float index = matrixIndex[i];
    float3x4 model = float3x4(boneMatrix[index + 0],
                              boneMatrix[index + 1],
                              boneMatrix[index + 2]);

    float3 bonePosition = mul(model, float4(position, 1));
    // Assume no scaling in matrix, just rotate & translate
    float3x3 rotate = float3x3(model[0].xyz,
                               model[1].xyz,
                               model[2].xyz);

    float3 boneNormal = mul(rotate, normal);
    netPosition += weight[i] * bonePosition;
    netNormal   += weight[i] * boneNormal;
  }

  netNormal = normalize(netNormal);
  oPosition = mul(modelViewProj, float4(netPosition, 1));
  oTexCoord = texCoord;
  color = computeLighting(light, netPosition, netNormal);
}

Location formula for array element: arrayLocation + index * elementSize; Location formula for matrix element: matrixLocation + (rowIndex * numberOfColumns + columnIndex) * elementSize; It's obvious why indexing a matrix would be slower.. — zoran404
– zoran404, Commented Jul 14, 2015 at 12:59
@zoran404 But if boneMatrix was actually an array of float3x4 matrices then won't it just be a single indexing to the array like float3x4 model = boneMatrix[i] ? I can't see why it would involve indexing into the internal elements of a matrix during the construction of the model matrix. — EternalWind
– EternalWind, Commented Jul 14, 2015 at 14:47

concept3d · Accepted Answer · 2015-07-14 21:10:05Z

For performance reasons, the program treats boneMatrix as an array of float4 vectors rather than an array of float3x4 matrices. The matrixIndex array contains floating-point values instead of integers, and so the addressing of a single array of vectors is more efficient than accessing an array of matrices.

There is definitely a performance implication for using an array matrices, personally I can see two reasons for this. Memory layout and index calculations.

Generally speaking contiguous memory layout is much faster to access than non-contiguous memory layout. It's a common practice to flatten 2D arrays into 1D arrays. As noted the implication is that it leaks the implementation so you have to handle this fact by changing the indexing.

But I can be wrong, I suspect that the matrix implementation in Cg language actually a 1D array. So this brings us to the second point, if the implementation in Cg of a matrix is actually a 1D array the only different between using an array of matrices or array of vectors (flattened matrix) is actually the index calculation, given in the article they are using floating point to calculate the index and a single precision floating point multiply, add, and multiply-add take 4 clock cycles per warp. The arrays of vectors only needs one index the array of matrices need two, this leads to less instructions per lookup, and remember in shaders every instruction matters.

Update regarding your question in the comments:

But according to the code, they actually convert the vectors back to matrices in the for loop. Isn't that be more efficient to do something like using an array of matrices and index it directly to get the model matrix? It will be only one indexing instead of three when constructing the model matrix.

What you said makes sense, what I speculate is that the compiler will notice that boneMatrix doesn't change. So the compiler won't allocate a new matrix and just reference the old values, so it's not actually constructing a new matrix just aliasing the vectors to be able to use matrix operations. But how can we be sure? someone need to check the generated code..

Update: this has been confirmed by @EternalWind (check the comments) the compiler doesn't construct a new matrix but actually reference the vectors, moreover it was able to vectorize the operation using dot product.

Everything in RAM is 1D. It's not a practice to flatten a 2D array to 1D, it's something that you have to do. And why are you mentioning the number of cycles the floating point operations take? (and even claiming they take 4 clock cycles!) Indexing is done with integers. — zoran404
– zoran404, Commented Jul 14, 2015 at 13:13
zoran404 and Pip, you seem to have misread concept3d's answer. It does not claim that memory layout could ever be non-1D, only that the index could be, and that changing from an array with 2 indices to an array with 1 index is a common practice. You absolutely can do index math in floats, as long as you (or the compiler) clamp the result to an integer before doing the actual lookup. — DMGregory
– DMGregory ♦, Commented Jul 14, 2015 at 14:25
@concept3d Thanks for the detailed answer. But according to the code, they actually convert the vectors back to matrices in the for loop. Isn't that be more efficient to do something like using an array of matrices and index it directly to get the model matrix? It will be only one indexing instead of three when constructing the model matrix. — EternalWind
– EternalWind, Commented Jul 14, 2015 at 14:40
@concept3d After checking out the generated GLSL code, I found out that indeed if a matrix is used directly instead of constructing it from vectors in Cg, the generated code is quite inefficient. While the vector-based one generates something like t0.x = dot(boneMatrix[0], multiplyingVec4) for each computed component, the matrix-based one generates t0.xyz = boneMatrix[0][1].xyz * multiplyingVec4.xxx; t0.xyz = boneMatrix[0][0].xyz * multiplyingVec4.yyy + t0.xyz; t0.xyz = boneMatrix[0][2].xyz * multiplyingVec4.zzz + t0.xyz; t0.xyz = boneMatrix[0][3].xyz * multiplyingVec4.www + t0.xyz; — EternalWind
– EternalWind, Commented Jul 14, 2015 at 17:25

Stack Exchange Network

Why is addressing an array of vectors more efficient than addressing an array of matrices in Cg?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Why is addressing an array of vectors more efficient than addressing an array of matrices in Cg?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions