You know, arranging the data in memory so that all the calculations could be done in one (very optimized, SSE, etc.) loop would help. HOWEVER, take into account that you're accessing a lot of memory doing a very, very fast operation, so the optimization won't be much. AND, if you choose to rearrange the memory, the arranging process it would be maybe slower than the calculation itself.
Looking at this problem, it comes to my mind an article by Charles Petzold on the book "Beautiful Code". You could generate code patterns for each value of each line of the loop (100 different code patterns) that only generate an assignation if the corresponding bit value is 1, and then "junp" to the correct implementation depending on the bit value of the line you're processing. You would need to use bitfields for the different masks. You convert a 3 nested loop into a 2 nested loop with optimized code for the inner loop (not too bad), having to generate using some other utility (or just plain C/C++) the code itself for the different values of the inner loop. You should read the chapter to understand it. Really neat.
bool arr[][][], I'd use a bitmask to store the data. Operating on a bitmask would be far more natural and almost certainly more efficient.