C++: Accessing C-array much faster when accessing elements sequentially

Question

I want to store a 3d-volume in memory. I use a linear array for this purpose and then calculate the 1d-index from the 3d-index. It is wrapped in a class called Volume that provides functions for accessing the data elements of the array. Here is the function for accessing one data element of the volume:

template<typename T>
inline T& Volume<T>::at(size_t x, size_t y, size_t z) {
    if (x >= this->xMax || y >= this->yMax || z >= this->zMax) throw std::out_of_range("Volume index out of bounds");
    return this->volume[x * this->yMax*this->zMax + y*this->zMax + z]
}

Now, this linearises the 3d volume with a Z-fastest index order. If the volume is accessed in a loop like this, it is iterated sequentially over the volume elements as they lie in memory:

Volume<float> volume(10, 20, 30); //parameters define size
for(int x = 0; x < volume.xSize(); ++x) {
    for(int y = 0; y < volume.ySize(); ++y) {
        for int z = 0; z < volume.zSize(); ++z) {
            volume.at(x, y, z);  //do sth with this voxel
        }
    }
}

However, if I wrote the loops like this, they would not be accessed sequentially but in a more "random" order:

Volume<float> volume(10, 20, 30); //parameters define size
for(int z = 0; z < volume.zSize(); ++z) {
    for(int y = 0; y < volume.ySize(); ++y) {
        (for int x = 0; x < volume.zSize(); ++x) {
            volume.at(x, y, z);  //do sth with this voxel
        }
    }
}

Now, the first case runs fast, the second case slow. My first question is: why? I guess it has got something to do with caching, but I'm not sure.

Now, I could rewrite the access function for the volume elements like this:

template<typename T>
inline T& Volume<T>::at(size_t x, size_t y, size_t z) {
    if (x >= this->xMax || y >= this->yMax || z >= this->zMax) throw std::out_of_range("Volume index out of bounds");
    return this->volume[x * this->yMax*this->zMax + y*this->zMax + z]
}

Then loop order #2 would be fast (because access happens sequentially), but loop order #1 slow.

Now, for some reason I need both index orders in my program. And both should be fast. The idea is that it shall be possible to define the index ordering when the volume is created and this index ordering will then be used. First I tried a simple if-else statement in the at function. However, that did not seem to do the trick.

So I tried something like this when the ordering mode is set:

template<typename T>
void Volume<T>::setMemoryLayout(IndexOrder indexOrder) {
    this->mode = indexOrder;
    if (indexOrder == IndexOrder::X_FASTEST) {
        this->accessVoxel = [this](size_t x, size_t y, size_t z)->T& {
            return this->volume[z * this->yMax*this->xMax + y*this->xMax + x];
        };
    } else {
        this->accessVoxel = [this](size_t x, size_t y, size_t z)->T& {
            return this->volume[x * this->yMax* this->zMax + y*this->zMax + z];
        };
    }
}

And then when a voxel is actually accessed:

template<typename T>
inline T& Volume<T>::at(size_t x, size_t y, size_t z) {
    if (x >= this->xMax || y >= this->yMax || z >= this->zMax) throw std::out_of_range("Volume index out of bounds");
    return this->accessVoxel(x, y, z);
}

So my idea was to reduce that overhead from the if-statement that would be necessary inside the at function by dynamically defining a lambda function once when the current mode is changed. It then only has to be called when at is called. However, this did not achieve what I desired.

My question is why my attempts didn't work, and if there is a way I can actually do what I want: a volume that supports X-fastest as well as Y-fastest index ordering and is offering the corresponding performance gain when looped over accordingly.

NOTE: My goal is not to be able to switch between the two modes while there is data assigned to the volume with the data still being read correctly.

You called it on caching. As soon as you start hopping around you don't get nice long runs of data that can be read in batches. The CPU is going to read not only the memory you want, but some number of it's neighbours off of the assumption that if you want N, then pretty quick you will want N+1. Might as well load N+1 now and save time later. If you want N and then N+100, you're going to read the memory around N and then the memory around N+100 and that gets painful. — user4581301
– user4581301, Commented Mar 17, 2016 at 23:06
Is your volume 3D vector really that small, or is it larger? — Dialecticus
– Dialecticus, Commented Mar 17, 2016 at 23:32
"this did not achieve what I desired" is not a question. You are talking about fast, but what is fast enough for you? And there are some bugs in your code I guess, e.g. for int x = 0; x < volume.zSize(); ++x. And yes you can specify storage mode like row_major or column_major at compile time e.g. through specialization. — knivil
– knivil, Commented Mar 17, 2016 at 23:34
An if that tests a compile-time constant will optimize away to literally nothing, like it was never there. Evaluating a lamba-expression might not be optimized away when the compiler can't prove what accessVoxel is set to. Like @knivil said, making row major vs column major storage order a template parameter should do the trick nicely. You always want the functions of this class to inline, so there's no bloat from needing to generate more stand-alone versions of functions. — Peter Cordes
– Peter Cordes, Commented Mar 18, 2016 at 1:58
stackoverflow.com/questions/12264970/… has an answer with some links to other questions, so if you want to start wading through lots of text about sequential vs. strided access, and potential gotchas when the stride is a multiple of 2048B (cache aliasing) / 4096B (false dependency between accesses on Intel), then start looking there. — Peter Cordes
– Peter Cordes, Commented Mar 18, 2016 at 2:05

Hal · Accepted Answer · 2016-03-17 23:30:26Z

4

On my cpu (and probably yours) I have 64 byte cache lines. Each cache line holds 16 4 byte floats. When a cache line is fetched for the first float, you don't need repeat that work for the following 15 when accessing sequentially.

Note that it takes about 240 cycles to fetch a cache line from main memory. Fetching from L1 cache is something like 12 cycles, this is a big difference if you can hit L1 repeatedly. (L2 costs about 40 cycles, L3 150 cycles)

The second caching win from sequential access is that the CPU will prefetch data into cache for you when reading sequentially. So if you start at the beginning of an array and move through it sequentially you can even avoid the penalty for reading the cache line in.

L1 is usually 32k of data (and 32k of instruction cache), for me on this machine L2 is 256K, and L3 is megabytes. So the smaller you can keep your memory working set, the more of it you can fit in a given cache. Fitting it all in L1 is optimal.

The third reason sequential access is optimal is that it gives your compiler the opportunity to vectorise the instructions. ie use either SSE or AVX instructions. AVX registers are 32 bytes so can hold 8 floats. Potentially you can operate on 8 consecutive items in the array at once speeding things up by 8 times.

answered Mar 17, 2016 at 23:30

Hal

1,1399 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Peter Cordes Over a year ago

L1 load-use latency in typical x86 CPUs is 4 or 5 cycles. Intel CPUs with per-core L2 caches have an L2 latency of something like 12c. See agner.org/optimize. The large shared L3 is much slower, esp. on a big Xeon where it can be farther away around the ring bus, or if other cores are keeping the L3 busy as well. Still upvoted for mostly-correct ideas; the specific numbers aren't that important. Do note that modern hardware prefetchers can recognize strided access patterns, so you don't have to pay the full latency penalty for every access. Many can be in flight.

Peter Cordes Over a year ago

The point about SIMD only working when the inner loop is accessing contiguous memory is an important one, though, beside making full use of each cache-line fetch.

Hal Over a year ago

@PeterCordes Last cpu I measured went 12,40,150,240 in cycle count from L1 through to main memory (some core i9 xeon). Ulrich Drepper measured a Pentium M and got 3,14,240 (No L3?) Pointing to the top level of Agner's site didn't help me find his numbers, I wonder if you can link them better than that? I don't know which cpu has the numbers you're quoting. So yes, absolutely, the numbers are CPU dependent. Drepper in case someone reading this hasn't seen it and cares - it's seminal: akkadia.org/drepper/cpumemory.pdf btw "mostly-correct ideas[sic]" will be perceived as insulting.

Peter Cordes Over a year ago

Agner's microarch pdf has the cache latency numbers. I tend to just link the top level because I can type that url accurately, and I don't want people to miss the other stuff if I just link one. Sorry for imprecise linking. Table 10.11 on pg143 for Haswell. His measurements are: L1: 4c / L2: 12c / L3: 34c. That's on a desktop, not a Xeon, and is probably a best-case figure. L1/L2 will be the same for Xeon vs. laptop because they're per-core. re: P-M: Intel didn't switch to the big shared-L3 until Nehalem, so yes, Pentium M only has L1 / L2.

Peter Cordes Over a year ago

Good call on linking Drepper's paper, it's great (but the software-prefetch recommendations are getting a bit out of date, because they were written for P4. Later CPUs have smarter HW prefetchers, so as I understand it, software prefetch is mostly useful for less-predictable patterns. (e.g. a binary-search can prefetch both possibilities for the iteration after this (+1/4 or +3/4)). Anyway, I added it to the x86 tag wiki a couple month ago.

|

Dialecticus · Accepted Answer · 2016-03-17 23:28:41Z

0

Physical memory of your computer is not large enough to contain the whole array. One solution to your problem would be to add more memory.

Your OS works with virtual memory. Pages of virtual memory will be moved to disk whenever some more memory is needed. Accessing a disk is time consuming, and that's what kills the performance. In your worse case scenario OS keeps writing and reading (or just reading) the pages all the time. So, another solution would be to reorganize your data in such a way that disk access is about the same regardless of the orientation the pixels are scanned. I propose to have 3D areas the size of a page (usually 4KB, so a cube with the size of 16 pixels). So, when you scan in one direction you would touch only a few of these pages, and when you scan in another direction you would touch the same amount of different set of pages. With a little luck (depends of available physical memory) no pages will needlessly move in and out of the swap file.

The best and simplest solution is to scan the pixels in only one direction. Maybe you don't really have to have the ability to scan the pixels sideways.

answered Mar 17, 2016 at 23:28

Dialecticus

16.8k7 gold badges47 silver badges108 bronze badges

1 Comment

Peter Cordes Over a year ago

The OP gave examples with tiny sizes. This is a caching issue, not a VM paging issue. The problem is essentially the same, though: only making use of a small amount of the data in each cache-line or page as it's brought in from main mem or disk.

Collectives™ on Stack Overflow

C++: Accessing C-array much faster when accessing elements sequentially

2 Answers 2

11 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related