2

Is there a general strategy to create an efficient bit permutation algorithm. The goal is to create a fast branch-less and if possible LUT-less algorithm. I'll give an example:

A 13 bit code is to be transformed into another 13 bit code according to the following rule table:

BIT INPUT (DEC) INPUT (BIN) OUTPUT (BIN) OUTPUT (DEC)
0 1 0000000000001 0000100000000 256
1 2 0000000000010 0010000000000 1024
2 4 0000000000100 0100000000000 2048
3 8 0000000001000 1000000000000 4096
4 16 0000000010000 0000001000000 64
5 32 0000000100000 0000000100000 32
6 64 0000001000000 0001000000000 512
7 128 0000010000000 0000000010000 16
8 256 0000100000000 0000000001000 8
9 512 0001000000000 0000000000010 2
10 1024 0010000000000 0000000000001 1
11 2048 0100000000000 0000000000100 4
12 4096 1000000000000 0000010000000 128

Example: If the input code is 1+2+4096=4099 the resulting output would be 256+1024+128=1408

A naive approach would be:

OUTPUT = ((INPUT AND 0000000000001) << 8) OR ((INPUT AND 0000000000010) << 9) OR ((INPUT AND 0000000000100) << 9) OR ((INPUT AND 0000000001000) << 9) OR ...

It means we have 3 instructions per bit (AND, SHIFT, OR) = 39-1 (last OR omitted) instructions for the above example. Instead we could also use a combination of left and right shifts to potentially reduce code size (depends on target platform), but this will not decrease the amount of instructions.

When inspecting the example table, you will of course notice a few obvious possibilities for optimization, for example in line 2/3/4 which can be combined as ((INPUT AND 0000000000111) << 9). But beside that it is becoming a difficult tedious task.

Are the general strategies? I think using Karnaugh-Veitch Map's to simplify the expression could be one approach? However it is pretty difficult for 13 input variables. Also the resulting expression would only be a combination of OR's and AND's.

17
  • It looks a bit like: Optimize lookup tables to simple ALU Commented Jun 20, 2022 at 10:48
  • Am I correct that the output is a fixed permutation of bits of the input? Commented Jun 20, 2022 at 12:08
  • @RafałDowgird Yes you are right, each bit in input has an fixed bit on output. May be a better title to ask would be for a bit permutation algorithm strategy? Commented Jun 20, 2022 at 12:12
  • You can balance both. First 10 bits from LUT (that is 4kB) and remaining 3 bits from computation (9 operations). 100%-cache-hit on L1 query and 9 bitwise operations should be doable in instruction-level-parallelism. Commented Jun 20, 2022 at 12:18
  • What exactly is the input/output here? Do I understand correctly the input is a translation table (like you have provided, but with redundancy removed) and the output should be an expression in C-style syntax? A string? A tree? Commented Jun 20, 2022 at 12:22

3 Answers 3

2

For bit permutations, several strategies are known that work in certain cases. There's a code generator at https://programming.sirrida.de/calcperm.php which implements most of them. However, in this case, it seems to find only basically the strategy you suggested, indicating that it seems hard to find any pattern to exploit in this permutation.

Sign up to request clarification or add additional context in comments.

1 Comment

This was the information I was looking for. There is plenty of explanation about strategies for various concepts.
1

If one big lookup table is too much, you can try to use two smaller ones.

  1. Take 7 lower bits of the input, look up a 16-bit value in table A.

  2. Take 6 higher bits of the input, look up a 16-bit value in table B.

  3. or the values from 1. and 2. to produce the result.

Table A needs 128*2 bytes, table B needs 64*2 bytes, that's 384 bytes for the lookup tables.

Comments

1

This is a hand-optimised multiple LUT solution, which doesn't really prove anything other than that I had some time to burn.

Multiple small lookup tables can occasionally save time and/or space, but I don't know of a strategy to find the optimal combination. In this case, the best division seems to be three LUTs of three bits each (bits 4-6, 7-9 and 10-12), totalling 24 bytes (each table has 8 one-byte entries), plus a simple shift to cover bits through 3, and another simple shift for the remaining bit 0. Bit 5, which is untransformed, was also a tempting target but I don't see a good way to divide bit ranges around it.

The three look-up tables have single-byte entries because the range of the transformations for each range is just one byte. In fact, the transformations for two of the bit ranges fit entirely in the low-order byte, avoiding a shift.

Here's the code:

unsigned short permute_bits(unsigned short x) {
#define LUT3(BIT0, BIT1, BIT2)                                \
{ 0,      (BIT0),        (BIT1),        (BIT1)+(BIT0),        \
  (BIT2), (BIT2)+(BIT0), (BIT2)+(BIT1), (BIT2)+(BIT1)+(BIT0)}
    static const unsigned char t4[] =  LUT3(1<<(6-3), 1<<(5-3), 1<<(9-3));
    static const unsigned char t7[] =  LUT3(1<<4, 1<<3, 1<<1);
    static const unsigned char t10[] = LUT3(1<<0, 1<<2, 1<<7);
#undef LUT3
    return   (   (x&1)       << 8)    // Bit 0
           + (   (x&14)      << 9)    // Bits 1-3, simple shift
           + (t4[(x>>4)&7]   << 3)    // Bits 4-6, see below
           + (t7[(x>>7)&7]       )    // Bits 7-9, three-bit lookup for LOB
           + (t10[(x>>10)&7]     );   // Bits 10-12, ditto
}

Note on bits 4-6

Bit 6 is transformed to position 9, which is outside of the low-order byte. However, bits 4 and 5 are moved to positions 6 and 5, respectively, and the total range of the transformed bits is only 5 bit positions. Several different final shifts are possible, but keeping the shift relatively small provides a tiny improvement on x86 architecture, because it allows the use of LEA to do a simultaneous shift and add. (See the second last instruction in the assembly below.)

The intermediate results are added instead of using boolean OR for the same reason. Since the sets of bits in each intermediate result are disjoint, ADD and OR have the same result; using add can take advantage of chip features like LEA.

Here's the compilation of that function, taken from http://gcc.godbolt.org using gcc 12.1 with -O3:

permute_bits(unsigned short):
        mov     edx, edi
        mov     ecx, edi
        movzx   eax, di
        shr     di, 4
        shr     dx, 7
        shr     cx, 10
        and     edi, 7
        and     edx, 7
        and     ecx, 7
        movzx   ecx, BYTE PTR permute_bits(unsigned short)::t10[rcx]
        movzx   edx, BYTE PTR permute_bits(unsigned short)::t7[rdx]
        add     edx, ecx
        mov     ecx, eax
        sal     eax, 9
        sal     ecx, 8
        and     ax, 7168
        and     cx, 256
        or      eax, ecx
        add     edx, eax
        movzx   eax, BYTE PTR permute_bits(unsigned short)::t4[rdi]
        lea     eax, [rdx+rax*8]
        ret

I left out the lookup tables themselves because the assembly produced by GCC isn't very helpful.

I don't know if this is any quicker than @trincot's solution (in a comment); a quick benchmark was inconclusive, but it looked to be a few percent quicker. But it's quite a bit shorter, possibly enough to compensate for the 24 bytes of lookup data.

1 Comment

This seems to be a pretty good implementation even on my PIC32 MCU. If I counted right there were only 19 instructions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.