Skip to main content
added 14 characters in body
Source Link
jpaver
  • 2.2k
  • 13
  • 11

One approach is to use an AoSoA (read: Array of Struct of Array) approach which is a hybrid of AoS and SoA. The idea is to store N structs worth of data in a contiguous chunk in SoA form, then the next N structs worth in SoA form.

Your AoS form for 16 vectors (labelled 0,1,2...F), swizzled at granularity of 4 structs is:

000111222333444555666777888999AAABBBCCCDDDEEEFFF
XYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZ

for SoA, this is:

0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF
XXXXXXXXXXXXXXXXYYYYYYYYYYYYYYYYZZZZZZZZZZZZZZZZXXXXXXXXXXXXXXXX

0123456789ABCDEF
YYYYYYYYYYYYYYYY

0123456789ABCDEF
ZZZZZZZZZZZZZZZZ

for AoSoA, this becomes:

01230123012345674567456789AB89AB89ABCDEFCDEFCDEF
XXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZ

The AoSoA approach has the following benefits of AoS:

  • Only a single DMA transfer is required to transfer a chunk of structs to SPU local memory.
  • structs still have a chance of all data fitting in a cacheline.
  • Block prefetching is still very easy.

The AoSoA approach also has these benefits of SoA form:

  • You can load data from SPU local memory directly into 128-bit vector registers without having to swizzle your data.
  • You can still operate on 4 structs at once.
  • You can fully utilize the SIMD'ness of your vector processor if there is no basic branching (ie. no unused lanes in your vector arithmetic)

The AoSoA approach still has some of these drawbacks of SoA form:

  • object management has to be done at swizzling granularity.
  • random access writes of a full struct now needs to touch scattered memory.
  • (these can turn out to be non-issues depending on how you organize/manage your structs and their lifetime)

BTW, these AoSoA concepts apply very well to SSE/AVX/LRBni, as well as GPUs which can be likened to very wide SIMD processors eg. 32/48/64 wide depending on the vendor/architecture.

One approach is to use an AoSoA (read: Array of Struct of Array) approach which is a hybrid of AoS and SoA. The idea is to store N structs worth of data in a contiguous chunk in SoA form, then the next N structs worth in SoA form.

Your AoS form for 16 vectors (labelled 0,1,2...F), swizzled at granularity of 4 structs is:

000111222333444555666777888999AAABBBCCCDDDEEEFFF
XYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZ

for SoA, this is:

0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF
XXXXXXXXXXXXXXXXYYYYYYYYYYYYYYYYZZZZZZZZZZZZZZZZ

for AoSoA, this becomes:

01230123012345674567456789AB89AB89ABCDEFCDEFCDEF
XXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZ

The AoSoA approach has the following benefits of AoS:

  • Only a single DMA transfer is required to transfer a chunk of structs to SPU local memory.
  • structs still have a chance of all data fitting in a cacheline.
  • Block prefetching is still very easy.

The AoSoA approach also has these benefits of SoA form:

  • You can load data from SPU local memory directly into 128-bit vector registers without having to swizzle your data.
  • You can still operate on 4 structs at once.
  • You can fully utilize the SIMD'ness of your vector processor if there is no basic branching (ie. no unused lanes in your vector arithmetic)

The AoSoA approach still has some of these drawbacks of SoA form:

  • object management has to be done at swizzling granularity.
  • random access writes of a full struct now needs to touch scattered memory.
  • (these can turn out to be non-issues depending on how you organize/manage your structs and their lifetime)

BTW, these AoSoA concepts apply very well to SSE/AVX/LRBni, as well as GPUs which can be likened to very wide SIMD processors eg. 32/48/64 wide depending on the vendor/architecture.

One approach is to use an AoSoA (read: Array of Struct of Array) approach which is a hybrid of AoS and SoA. The idea is to store N structs worth of data in a contiguous chunk in SoA form, then the next N structs worth in SoA form.

Your AoS form for 16 vectors (labelled 0,1,2...F), swizzled at granularity of 4 structs is:

000111222333444555666777888999AAABBBCCCDDDEEEFFF
XYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZ

for SoA, this is:

0123456789ABCDEF
XXXXXXXXXXXXXXXX

0123456789ABCDEF
YYYYYYYYYYYYYYYY

0123456789ABCDEF
ZZZZZZZZZZZZZZZZ

for AoSoA, this becomes:

01230123012345674567456789AB89AB89ABCDEFCDEFCDEF
XXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZ

The AoSoA approach has the following benefits of AoS:

  • Only a single DMA transfer is required to transfer a chunk of structs to SPU local memory.
  • structs still have a chance of all data fitting in a cacheline.
  • Block prefetching is still very easy.

The AoSoA approach also has these benefits of SoA form:

  • You can load data from SPU local memory directly into 128-bit vector registers without having to swizzle your data.
  • You can still operate on 4 structs at once.
  • You can fully utilize the SIMD'ness of your vector processor if there is no basic branching (ie. no unused lanes in your vector arithmetic)

The AoSoA approach still has some of these drawbacks of SoA form:

  • object management has to be done at swizzling granularity.
  • random access writes of a full struct now needs to touch scattered memory.
  • (these can turn out to be non-issues depending on how you organize/manage your structs and their lifetime)

BTW, these AoSoA concepts apply very well to SSE/AVX/LRBni, as well as GPUs which can be likened to very wide SIMD processors eg. 32/48/64 wide depending on the vendor/architecture.

Source Link
jpaver
  • 2.2k
  • 13
  • 11

One approach is to use an AoSoA (read: Array of Struct of Array) approach which is a hybrid of AoS and SoA. The idea is to store N structs worth of data in a contiguous chunk in SoA form, then the next N structs worth in SoA form.

Your AoS form for 16 vectors (labelled 0,1,2...F), swizzled at granularity of 4 structs is:

000111222333444555666777888999AAABBBCCCDDDEEEFFF
XYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZ

for SoA, this is:

0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF
XXXXXXXXXXXXXXXXYYYYYYYYYYYYYYYYZZZZZZZZZZZZZZZZ

for AoSoA, this becomes:

01230123012345674567456789AB89AB89ABCDEFCDEFCDEF
XXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZ

The AoSoA approach has the following benefits of AoS:

  • Only a single DMA transfer is required to transfer a chunk of structs to SPU local memory.
  • structs still have a chance of all data fitting in a cacheline.
  • Block prefetching is still very easy.

The AoSoA approach also has these benefits of SoA form:

  • You can load data from SPU local memory directly into 128-bit vector registers without having to swizzle your data.
  • You can still operate on 4 structs at once.
  • You can fully utilize the SIMD'ness of your vector processor if there is no basic branching (ie. no unused lanes in your vector arithmetic)

The AoSoA approach still has some of these drawbacks of SoA form:

  • object management has to be done at swizzling granularity.
  • random access writes of a full struct now needs to touch scattered memory.
  • (these can turn out to be non-issues depending on how you organize/manage your structs and their lifetime)

BTW, these AoSoA concepts apply very well to SSE/AVX/LRBni, as well as GPUs which can be likened to very wide SIMD processors eg. 32/48/64 wide depending on the vendor/architecture.