General algorithm for bit permutations

Question

Is there a general strategy to create an efficient bit permutation algorithm. The goal is to create a fast branch-less and if possible LUT-less algorithm. I'll give an example:

A 13 bit code is to be transformed into another 13 bit code according to the following rule table:

BIT	INPUT (DEC)	INPUT (BIN)	OUTPUT (BIN)	OUTPUT (DEC)
0	1	0000000000001	0000100000000	256
1	2	0000000000010	0010000000000	1024
2	4	0000000000100	0100000000000	2048
3	8	0000000001000	1000000000000	4096
4	16	0000000010000	0000001000000	64
5	32	0000000100000	0000000100000	32
6	64	0000001000000	0001000000000	512
7	128	0000010000000	0000000010000	16
8	256	0000100000000	0000000001000	8
9	512	0001000000000	0000000000010	2
10	1024	0010000000000	0000000000001	1
11	2048	0100000000000	0000000000100	4
12	4096	1000000000000	0000010000000	128

Example: If the input code is 1+2+4096=4099 the resulting output would be 256+1024+128=1408

A naive approach would be:

OUTPUT = ((INPUT AND 0000000000001) << 8) OR ((INPUT AND 0000000000010) << 9) OR ((INPUT AND 0000000000100) << 9) OR ((INPUT AND 0000000001000) << 9) OR ...

It means we have 3 instructions per bit (AND, SHIFT, OR) = 39-1 (last OR omitted) instructions for the above example. Instead we could also use a combination of left and right shifts to potentially reduce code size (depends on target platform), but this will not decrease the amount of instructions.

When inspecting the example table, you will of course notice a few obvious possibilities for optimization, for example in line 2/3/4 which can be combined as ((INPUT AND 0000000000111) << 9). But beside that it is becoming a difficult tedious task.

Are the general strategies? I think using Karnaugh-Veitch Map's to simplify the expression could be one approach? However it is pretty difficult for 13 input variables. Also the resulting expression would only be a combination of OR's and AND's.

Am I correct that the output is a fixed permutation of bits of the input? — Rafał Dowgird
– Rafał Dowgird, Commented Jun 20, 2022 at 12:08
@RafałDowgird Yes you are right, each bit in input has an fixed bit on output. May be a better title to ask would be for a bit permutation algorithm strategy? — bkausbk
– bkausbk, Commented Jun 20, 2022 at 12:12
You can balance both. First 10 bits from LUT (that is 4kB) and remaining 3 bits from computation (9 operations). 100%-cache-hit on L1 query and 9 bitwise operations should be doable in instruction-level-parallelism. — huseyin tugrul buyukisik
– huseyin tugrul buyukisik, Commented Jun 20, 2022 at 12:18
What exactly is the input/output here? Do I understand correctly the input is a translation table (like you have provided, but with redundancy removed) and the output should be an expression in C-style syntax? A string? A tree? — trincot
– trincot, Commented Jun 20, 2022 at 12:22

Falk Hüffner · Accepted Answer · 2022-06-20 20:58:21Z

2

For bit permutations, several strategies are known that work in certain cases. There's a code generator at https://programming.sirrida.de/calcperm.php which implements most of them. However, in this case, it seems to find only basically the strategy you suggested, indicating that it seems hard to find any pattern to exploit in this permutation.

answered Jun 20, 2022 at 20:58

Falk Hüffner

5,0401 gold badge21 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

bkausbk Over a year ago

This was the information I was looking for. There is plenty of explanation about strategies for various concepts.

Rafał Dowgird · Accepted Answer · 2022-06-20 20:45:17Z

1

If one big lookup table is too much, you can try to use two smaller ones.

Take 7 lower bits of the input, look up a 16-bit value in table A.
Take 6 higher bits of the input, look up a 16-bit value in table B.
or the values from 1. and 2. to produce the result.

Table A needs 128*2 bytes, table B needs 64*2 bytes, that's 384 bytes for the lookup tables.

answered Jun 20, 2022 at 20:45

Rafał Dowgird

45.5k11 gold badges80 silver badges95 bronze badges

Comments

rici · Accepted Answer · 2022-06-20 23:58:31Z

This is a hand-optimised multiple LUT solution, which doesn't really prove anything other than that I had some time to burn.

Multiple small lookup tables can occasionally save time and/or space, but I don't know of a strategy to find the optimal combination. In this case, the best division seems to be three LUTs of three bits each (bits 4-6, 7-9 and 10-12), totalling 24 bytes (each table has 8 one-byte entries), plus a simple shift to cover bits through 3, and another simple shift for the remaining bit 0. Bit 5, which is untransformed, was also a tempting target but I don't see a good way to divide bit ranges around it.

The three look-up tables have single-byte entries because the range of the transformations for each range is just one byte. In fact, the transformations for two of the bit ranges fit entirely in the low-order byte, avoiding a shift.

Here's the code:

unsigned short permute_bits(unsigned short x) {
#define LUT3(BIT0, BIT1, BIT2)                                \
{ 0,      (BIT0),        (BIT1),        (BIT1)+(BIT0),        \
  (BIT2), (BIT2)+(BIT0), (BIT2)+(BIT1), (BIT2)+(BIT1)+(BIT0)}
    static const unsigned char t4[] =  LUT3(1<<(6-3), 1<<(5-3), 1<<(9-3));
    static const unsigned char t7[] =  LUT3(1<<4, 1<<3, 1<<1);
    static const unsigned char t10[] = LUT3(1<<0, 1<<2, 1<<7);
#undef LUT3
    return   (   (x&1)       << 8)    // Bit 0
           + (   (x&14)      << 9)    // Bits 1-3, simple shift
           + (t4[(x>>4)&7]   << 3)    // Bits 4-6, see below
           + (t7[(x>>7)&7]       )    // Bits 7-9, three-bit lookup for LOB
           + (t10[(x>>10)&7]     );   // Bits 10-12, ditto
}

Note on bits 4-6

Bit 6 is transformed to position 9, which is outside of the low-order byte. However, bits 4 and 5 are moved to positions 6 and 5, respectively, and the total range of the transformed bits is only 5 bit positions. Several different final shifts are possible, but keeping the shift relatively small provides a tiny improvement on x86 architecture, because it allows the use of LEA to do a simultaneous shift and add. (See the second last instruction in the assembly below.)

The intermediate results are added instead of using boolean OR for the same reason. Since the sets of bits in each intermediate result are disjoint, ADD and OR have the same result; using add can take advantage of chip features like LEA.

Here's the compilation of that function, taken from http://gcc.godbolt.org using gcc 12.1 with -O3:

permute_bits(unsigned short):
        mov     edx, edi
        mov     ecx, edi
        movzx   eax, di
        shr     di, 4
        shr     dx, 7
        shr     cx, 10
        and     edi, 7
        and     edx, 7
        and     ecx, 7
        movzx   ecx, BYTE PTR permute_bits(unsigned short)::t10[rcx]
        movzx   edx, BYTE PTR permute_bits(unsigned short)::t7[rdx]
        add     edx, ecx
        mov     ecx, eax
        sal     eax, 9
        sal     ecx, 8
        and     ax, 7168
        and     cx, 256
        or      eax, ecx
        add     edx, eax
        movzx   eax, BYTE PTR permute_bits(unsigned short)::t4[rdi]
        lea     eax, [rdx+rax*8]
        ret

I left out the lookup tables themselves because the assembly produced by GCC isn't very helpful.

I don't know if this is any quicker than @trincot's solution (in a comment); a quick benchmark was inconclusive, but it looked to be a few percent quicker. But it's quite a bit shorter, possibly enough to compensate for the 24 bytes of lookup data.

This seems to be a pretty good implementation even on my PIC32 MCU. If I counted right there were only 19 instructions.

Collectives™ on Stack Overflow

General algorithm for bit permutations

3 Answers 3

1 Comment

Comments

Note on bits 4-6

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Note on bits 4-6

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related