2

Input is a bitarray stored in contiguous memory with 1 bit of the bitarray per 1 bit of memory.

Output is an array of the indices of set bits of the bitarray.

Example:

bitarray: 0000 1111 0101 1010
setA: {4,5,6,7,9,11,12,14}
setB: {2,4,5,7,9,10,11,12}

Getting either set A or set B is fine. The set is stored as an array of uint32_t so each element of the set is an unsigned 32 bit integer in the array.

How to do this about 5 times faster on a single cpu core?

current code:

#include <iostream>
#include <vector>
#include <time.h>

using namespace std;

template <typename T>
uint32_t bitarray2set(T& v, uint32_t * ptr_set){
    uint32_t i;
    uint32_t base = 0;
    uint32_t * ptr_set_new = ptr_set;
    uint32_t size = v.capacity();
    for(i = 0; i < size; i++){
        find_set_bit(v[i], ptr_set_new, base);
        base += 8*sizeof(uint32_t);
    }
    return (ptr_set_new - ptr_set);
}

inline void find_set_bit(uint32_t n, uint32_t*& ptr_set, uint32_t base){
    // Find the set bits in a uint32_t
    int k = base;
    while(n){
        if (n & 1){
            *(ptr_set) = k;
            ptr_set++;
        }
        n = n >> 1;
        k++;
    }
}

template <typename T>
void rand_vector(T& v){
    srand(time(NULL));
    int i;
    int size = v.capacity();
    for (i=0;i<size;i++){
        v[i] = rand();
    }
}

template <typename T>
void print_vector(T& v, int size_in = 0){
    int i;

    int size;
    if (size_in == 0){
        size = v.capacity();
    } else {
        size = size_in;
    }
    for (i=0;i<size;i++){
        cout << v[i] << ' ';
    }
    cout << endl;
}

int main(void){
    const int test_size = 6000;
    vector<uint32_t> vec(test_size);
    vector<uint32_t> set(test_size*sizeof(uint32_t)*8);
    rand_vector(vec);
    //for (int i; i < 64; i++) vec[i] = -1;
    //cout << "input" << endl;
    print_vector(vec);
    //cout << "calculate result" << endl;

    int i;
    int rep = 10000;
    uint32_t res_size;

    struct timespec tp_start, tp_end;
    clock_gettime(CLOCK_MONOTONIC, &tp_start);
    for (i=0;i<rep;i++){
        res_size = bitarray2set(vec, set.data());
    }
    clock_gettime(CLOCK_MONOTONIC, &tp_end);
    double timing;
    const double nano = 0.000000001;

    timing = ((double)(tp_end.tv_sec  - tp_start.tv_sec )
           + (tp_end.tv_nsec - tp_start.tv_nsec) * nano) /(rep);

    cout << "timing per cycle: " << timing << endl;
    cout << "print result" << endl;
    //print_vector(set, res_size);
}

result (compiled with icc -O3 code.cpp -lrt)

...
timing per cycle: 0.000739613 (7.4E-4).
print result

0.0008 seconds to convert 768000 bits to set. But there are at least 10,000 arrays of 768,000 bits in each cycle. That is 8 seconds per cycle. That is slow.

The cpu has popcnt instruction and sse4.2 instruction set.

Thanks.

Update


template <typename T>
uint32_t bitarray2set(T& v, uint32_t * ptr_set){
    uint32_t i;
    uint32_t base = 0;
    uint32_t * ptr_set_new = ptr_set;
    uint32_t size = v.capacity();
    uint32_t * ptr_v;
    uint32_t * ptr_v_end = &(v[size]);
    for(ptr_v = v.data(); ptr_v < ptr_v_end; ++ptr_v){
        while(*ptr_v) {
           *ptr_set_new++ = base + __builtin_ctz(*ptr_v);
           (*ptr_v) &= (*ptr_v) - 1;  // zeros the lowest 1-bit in n
        }
        base += 8*sizeof(uint32_t);
    }
    return (ptr_set_new - ptr_set);
}

This updated version uses the inner loop provided by rhashimoto. I don't know if the inlining actually makes the function slower (i never thought that can happen!). The new timing is 1.14E-5 (compiled by icc -O3 code.cpp -lrt, and benchmarked on random vector).

Warning:

I just found that reserving instead of resizing a std::vector, and then write directly to the vector's data through raw pointing is a bad idea. Resizing first and then use raw pointer is fine though. See Robᵩ's answer at Resizing a C++ std::vector<char> without initializing data I am going to just use resize instead of reserve and stop worrying about the time that resize wastes by calling constructor of each element of the vector... at least vectors actually uses contiguous memory, like a plain array (Are std::vector elements guaranteed to be contiguous?)

9
  • Would you be able to trade off space for time? Table lookup could do it. Consider 8 tables - one for each 4-bit fields of the 32-bit word - of 16 entries each, where each entry is an array (or vector) of the indices for that 4-bit pattern of the uint32. Commented Jul 12, 2016 at 22:05
  • I think so. That is a good idea. How to use that table? do you mean to do bitwise & between the 32bit word from the input, and the key of the tables? Commented Jul 12, 2016 at 22:10
  • 1
    for each of 8 4-bit fields: grab the field, use it as an index into an 8-element array of a 16-element array of variable length vectors, append all items in that vector to your set of indexes being accumulated. Or you could use 1 256-entry vector for all 4 bytes, but you'd have to add 0/8/16/24 to the indexes as you added them to your result vector. Given your hint that you can use SSE4 I'm hoping someone answers with a cool SIMD approach. Why not add [sse] as a tag? Commented Jul 12, 2016 at 22:16
  • 1
    @davidbak: I don't think SSE4 helps (just SSE2 for vector integer add and shift), but popcnt does to solve the variable-length store problem. Commented Jul 14, 2016 at 6:03
  • re:resizing: reserve space and use emplace_back to construct new elements in-place as they are created. Commented Jul 14, 2016 at 18:59

2 Answers 2

6

I notice that you use .capacity() when you probably mean to use .size(). That could make you do extra unnecessary work, as well as giving you the wrong answer.

Your loop in find_set_bit() iterates over all 32 bits in the word. You can instead iterate only over each set bit and use the BSF instruction to determine the index of the lowest bit. GCC has an intrinsic function __builtin_ctz() to generate BSF or the equivalent - I think that the Intel compiler also supports it (you can inline assembly if not). The modified function would look like this:

inline void find_set_bit(uint32_t n, uint32_t*& ptr_set, uint32_t base){
    // Find the set bits in a uint32_t
    while(n) {
       *ptr_set++ = base + __builtin_ctz(n);
       n &= n - 1;  // zeros the lowest 1-bit in n
    }
}

On my Linux machine, compiling with g++ -O3, replacing that function drops the reported time from 0.000531434 to 0.000101352.

There are quite a few ways to find a bit index in the answers to this question. I do think that __builtin_ctz() is going to be the best choice for you, though. I don't believe that there is a reasonable SIMD approach to your problem, as each input word produces a variable amount of output.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for this answer. It helped a lot. The program becomes 3x faster on the machine, after using your inner loop. I also found the outer loop is kind of slow, and improved that in the updated version.
Variable-size output per chunk isn't a problem for SIMD, with the help of popcnt. See my answer.
I just found that reserving instead of resizing a std::vector, and then write directly to the vector's data through raw pointing is a bad idea. Resizing first and then use raw pointer is fine though. See Robᵩ's answer at stackoverflow.com/questions/7689406/… I am going to just use resize instead of reserve and stop worrying about the time that resize wastes by calling constructor of each member.
1

As suggested by @davidbak, you could use a table lookup to process 4 elements of the bitmap at once.

Each lookup produces a variable-sized chunk of set members, which we can handle by using popcnt.

@rhashimoto's scalar ctz-based suggestion will probably do better with sparse bitsets that have lots of zeros, but this should be better when there are a lot of set bits.

I'm thinking something like

// a vector of 4 elements for every pattern of 4 bits.
// values range from 0 to 3, and will have a multiple of 4 added to them.
alignas(16) static const int LUT[16*4] = { 0,0,0,0,  ... };

// mostly C, some pseudocode.
unsigned int bitmap2set(int *set, int input) {
    int *set_start = set;

    __m128i offset = _mm_setzero_si128();

    for (nibble in input[]) {  // pseudocode for the actual shifting / masking
        __m128i v = _mm_load_si128(&LUT[nibble]);
        __m128i vpos = _mm_add_epi32(v, offset);

        _mm_store((__m128i*)set, vpos);

        set += _mm_popcount_u32(nibble);    // variable-length store
        offset = _mm_add_epi32(offset, _mm_set1_epi32(4));  // increment the offset by 4
    }
    return  set - set_start;  // set size
}

When a nibble isn't 1111, the next store will overlap, but that's fine.

Using popcnt to figure out how much to increment a pointer is a useful technique in general for left-packing variable-length data into a destination array.

1 Comment

Thank you for this answer. I will try this one after finishing writing other parts of the program.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.