Parallel transformations tree update

Question

I create my own pet-project graphics engine for the learning and research purposes. Now I'm trying to create a more efficient scene transformations update technique. My current approach is the linear encoded transormation tree as follows: My tree node

    struct TreeNode
    {
        TransformIndex parent;
        TransformIndex firstChild;
        TransformIndex next;
        TransformIndex prev;
    };

It is contained in the vector

std::vector<TreeNode> tree_;

Also I have world and local matrices for every node

    std::vector<mat4x4> worldTransforms_;
    std::vector<mat4x4> localTransforms_;

Couple of the dirty flags and the unused indices flags:

    using Flags = boost::dynamic_bitset<uint64_t>;
    Flags worldDirty_;
    Flags localDirty_;
    Flags free_;

using this flags I insert the new nodes in the any free index, but always after their parent:

TransformIndex TransformTree::allocateIndex(TransformIndex parent)
{
    assert(parent < tree_.size() && !free_.test(parent));
    auto posIndex = free_.find_next(parent);
    if (posIndex == Flags::npos)
    {
        posIndex = static_cast<TransformIndex>(free_.size());
        resize(free_.size() * 2);
    }
    auto pos = static_cast<TransformIndex>(posIndex);
///........
}

so update of the world matrices at the begining of an every frame looks like this:

    for (auto index = worldDirty_.find_first(); index != Flags::npos; index = worldDirty_.find_next(index))
    {
        const auto parent = tree_[index].parent;
        assert(parent < index);
        worldTransforms_[index] = worldTransforms_[parent] * localTransforms_[index];
        worldDirty_.set(index, false);
    }

It's quite fast, but not as good as it can be. Main issue here that it's unparallelable since every single matrix update depends on another matrix somewhere back in the array and we have to know it's already updated. I want to find a way to change the data structure and/or algorithm to be done in parallel. Also it's would be nice to do it in the compute shader in the future. I've googled for a decent time, but couldn't find any examples of it beyond the basic mentions that such algorithms exist and used in modern engines. Please, give me any advices, how to improve my implementation to make it parallelisable.

Most scenes I've seen are forests of relatively shallow trees, rather than a single super deep tree (as long as you treat the scene root node separately). So I suspect there are ways you could partition that forest so that trees from one partition don't interleave with others, allowing you to process the partitions in parallel. — DMGregory
– DMGregory ♦, Commented Mar 4, 2021 at 13:13

Engineer · Accepted Answer · 2021-03-07 10:23:19Z

The Problem for GPU

Since each tree in the forest must have its pre-requisite ancestor nodes processed first, as you noted, the tree-in-forest becomes the basic unit of parallelisation.

Without a fixed / similar topological structure for every tree in the forest, conditionals are needed (in shader code) to determine each tree's structure on the fly. This leads to non-parallel processing per tree, and would thus work better on CPU.

One possible solution for GPU

Decide on some maximally-defined tree structure, and then for certain trees, use null nodes, thereby clipping off some topology uniquely per tree. You'd write code specifically to operate on that maximal tree structure (say 1 root, 3 nodes at depth 1, etc. - specifically), no conditionals. The same amount of geometry would be rendered, but some branches etc. would simply be degenerate (taking up zero volume) due to the null nodes specifying them.

This has an increased chance of cache misses, due to the expanded (non-sparse) data for each tree; but this may be acceptable. Your choice of maximal tree size is going to be critical to the performance of this solution. Too large and you'll have too many cache misses; too small and your trees won't look so great or be particularly varied.

Either way... GPU or CPU

every single matrix update depends on another matrix somewhere back in the array and we have to know it's already updated

So you need a parent node to always appear before any children in that tree's linear list/array, so that it has already been processed within this running kernel instance, and thus guaranteed to have been run, by the time you reach its descendants. Try a depth-first, pre-order traversal of each distinct tree in the forest. To give you some idea of how to order this way (pseudocode):

int addToBuffer(TreeNode node) //returns the number of descendants
{
    //anything called here is pre-order...
    dataBuffer.append(node);

    //recurse
    int descendantCount = 0;
    for each (child in node.children) //or however you need to do this
    {
        descendantCount += addToBuffer(child);
    }
    
    //...anything called here is post-order (nothing in this example)
    
    return descendantCount;
}

int startIndex = 0;
int descendantCount;

foreach (treeNode in forest) //each treeNode is the root of a tree
{
    descendantCount = addToBuffer(treeNode);
    Rande range = new Range();
    range.startIndex = startIndex;
    range.length = descendantCount;
    rangesBuffer.append(range);
    
    startIndex += length;
}

Each tree's individual linearised data is pushed back-to-back into a buffer, and we have a second buffer containing Ranges i.e. start/end index tuples for each, pointing into the first buffer. Each kernel instance is then spawned given the offset i.e. start and length, and the dataBuffer itself.

Optional: To find specific ID'ed nodes in this list after having inserted them, you'd need a third buffer or array, which would be a lookup table (LUT) of indices keyed by ID.

If you end up doing this on CPU & single-threaded only, then you can also get away with ordering purely by increasing depth across the entire forest, which is simpler. (see level order traversal). EDIT: For reasons of data locality, this is probably a worse idea than simply processing each individual tree from root down.

Thank you for the answer. I thought about the structure that you suggest.But I have some doubts about modifying such tree. If i want to add a new child to the node, I have to move the whole array after this child position. Or at least have some free spaces after the every level of the every subtree. Am I right? — FoxCanFly
– FoxCanFly, Commented Mar 5, 2021 at 13:36
As for your last suggestion - the level order traversal. I understand it like traversal the whole forest level by level. I don't understand, why it cannot be parallelized. Why we can't make inner loop (iterating the single level) parallel? — FoxCanFly
– FoxCanFly, Commented Mar 5, 2021 at 13:38
@FoxCanFly You can parallelise across a level. 1st though: why access a level at a time when doing a tree at a time is more efficient in terms of cache locality of data? 2nd: not every tree in the forest is going to have the same depth, correct? e.g. some will have no nodes at level 4 - how will you organise sparse data? Same problem as my solution has. 3rd: the only way to guarantee parents/ancestors have been processed before children/descendants, is multiple kernel passes, as you have no guarantees on thread ordering. Do you want the overhead of running multiple kernel passes (per frame)? — Engineer
– Engineer, Commented Mar 6, 2021 at 11:51
@FoxCanFly Re your doubts: Naturally this solution - and any other we come up with- will be imperfect, because you're wanting to do a task better suited to the CPU, on the GPU... I'm just giving you options. The approach I've given in this solution ensures that every thread / warp gets its own identically-structured dataset to operate on, without conditional branches. I'm not sure, levelwise or not, how you intend to operate on a sparse tree without this, and without using conditionals, or without at least using a pass per level - inefficient. You cannot have it all. It's a trade-off. — Engineer
– Engineer, Commented Mar 6, 2021 at 11:58

Stack Exchange Network

Parallel transformations tree update

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Parallel transformations tree update

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions