One approach is to use an AoSoA (read: Array of Struct of Array) approach which is a hybrid of AoS and SoA. The idea is to store N structs worth of data in a contiguous chunk in SoA form, then the next N structs worth in SoA form.
Your AoS form for 16 vectors (labelled 0,1,2...F), swizzled at granularity of 4 structs is:
000111222333444555666777888999AAABBBCCCDDDEEEFFF XYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZ
for SoA, this is:
0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF XXXXXXXXXXXXXXXXYYYYYYYYYYYYYYYYZZZZZZZZZZZZZZZZXXXXXXXXXXXXXXXX 0123456789ABCDEF YYYYYYYYYYYYYYYY 0123456789ABCDEF ZZZZZZZZZZZZZZZZ
for AoSoA, this becomes:
01230123012345674567456789AB89AB89ABCDEFCDEFCDEF XXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZ
The AoSoA approach has the following benefits of AoS:
- Only a single DMA transfer is required to transfer a chunk of structs to SPU local memory.
- structs still have a chance of all data fitting in a cacheline.
- Block prefetching is still very easy.
The AoSoA approach also has these benefits of SoA form:
- You can load data from SPU local memory directly into 128-bit vector registers without having to swizzle your data.
- You can still operate on 4 structs at once.
- You can fully utilize the SIMD'ness of your vector processor if there is no basic branching (ie. no unused lanes in your vector arithmetic)
The AoSoA approach still has some of these drawbacks of SoA form:
- object management has to be done at swizzling granularity.
- random access writes of a full struct now needs to touch scattered memory.
- (these can turn out to be non-issues depending on how you organize/manage your structs and their lifetime)
BTW, these AoSoA concepts apply very well to SSE/AVX/LRBni, as well as GPUs which can be likened to very wide SIMD processors eg. 32/48/64 wide depending on the vendor/architecture.