I create my own pet-project graphics engine for the learning and research purposes. Now I'm trying to create a more efficient scene transformations update technique. My current approach is the linear encoded transormation tree as follows: My tree node
struct TreeNode
{
TransformIndex parent;
TransformIndex firstChild;
TransformIndex next;
TransformIndex prev;
};
It is contained in the vector
std::vector<TreeNode> tree_;
Also I have world and local matrices for every node
std::vector<mat4x4> worldTransforms_;
std::vector<mat4x4> localTransforms_;
Couple of the dirty flags and the unused indices flags:
using Flags = boost::dynamic_bitset<uint64_t>;
Flags worldDirty_;
Flags localDirty_;
Flags free_;
using this flags I insert the new nodes in the any free index, but always after their parent:
TransformIndex TransformTree::allocateIndex(TransformIndex parent)
{
assert(parent < tree_.size() && !free_.test(parent));
auto posIndex = free_.find_next(parent);
if (posIndex == Flags::npos)
{
posIndex = static_cast<TransformIndex>(free_.size());
resize(free_.size() * 2);
}
auto pos = static_cast<TransformIndex>(posIndex);
///........
}
so update of the world matrices at the begining of an every frame looks like this:
for (auto index = worldDirty_.find_first(); index != Flags::npos; index = worldDirty_.find_next(index))
{
const auto parent = tree_[index].parent;
assert(parent < index);
worldTransforms_[index] = worldTransforms_[parent] * localTransforms_[index];
worldDirty_.set(index, false);
}
It's quite fast, but not as good as it can be. Main issue here that it's unparallelable since every single matrix update depends on another matrix somewhere back in the array and we have to know it's already updated. I want to find a way to change the data structure and/or algorithm to be done in parallel. Also it's would be nice to do it in the compute shader in the future. I've googled for a decent time, but couldn't find any examples of it beyond the basic mentions that such algorithms exist and used in modern engines. Please, give me any advices, how to improve my implementation to make it parallelisable.