Efficiently compute the mean bits of many bitstrings in python / numpy?

Question

I have a few thousand bitstrings stored as longs. Each bitstring is 1024 bits. I would like to create an array of the ratios each bit is 1.

For example (pseudocode):

bs = [
    1 0 0 0,
    0 1 1 0,
    1 1 0 0,
    0 0 0 0
]
ratios(bs) => [0.5, 0.5, 0.25 0.0]

My current slow code is:

def mean_signature(bitstrings, bit_count):
    means = []
    for b in range(bit_count):
        m = sum((x >> b) & 1 for x in bitstrings) / len(bitstrings)
        means.append(m)
    return means

I am about to modify the code so the outer loop is over bitstrings, but think I must be missing something. Maybe using numpy bit arrays.

No, the input is a list of longs. As far as I can tell it's not trivial to convert this to a numpy array - unless you use dtype=object (reference). — AnnanFay
– AnnanFay, Commented Apr 18, 2017 at 12:47

Warren Weckesser · Accepted Answer · 2017-04-18 16:18:00Z

2

Here's one way you can do it, but it is probably not the most efficient method possible.

For the demonstration, I'll use 8 bit integers, but it will also work with your 1024 bit integers.

In [28]: bs = [0b11110000, 0b11111100, 0b11000000, 0b11111110, 0b00001100]

In [29]: bs
Out[29]: [240, 252, 192, 254, 12]

In [30]: nbits = 8

In [31]: bits = np.array([list(np.binary_repr(b, width=nbits)) for b in bs], dtype=np.uint8)

In [32]: bits
Out[32]: 
array([[1, 1, 1, 1, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 0],
       [0, 0, 0, 0, 1, 1, 0, 0]], dtype=uint8)

bits is the array containing the bits of the binary representation of each value. The ratio that you want is the mean of the columns:

In [33]: bits.mean(axis=0)
Out[33]: array([ 0.8,  0.8,  0.6,  0.6,  0.6,  0.6,  0.2,  0. ])

The order of those values is from the highest order bit to the lowest. It might be more natural for the index of the result to match the usual bit indices. For that, just reverse the result:

In [34]: bits.mean(axis=0)[::-1]
Out[34]: array([ 0. ,  0.2,  0.6,  0.6,  0.6,  0.6,  0.8,  0.8])

edited Apr 18, 2017 at 16:18

answered Apr 18, 2017 at 14:30

Warren Weckesser

116k20 gold badges207 silver badges224 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Leon Over a year ago

I think the result must be reversed to correctly match the bit indices

Eric · Accepted Answer · 2017-04-18 14:57:24Z

The hardest step here is taking an object array of longs, and storing them in a numpy container suitable for vectorization. The following function breaks up a long (array) into little-endian words:

def long_to_multi_word(l, dtype=np.uint64, nwords=None):
    dtype = np.dtype(dtype)
    l = np.asarray(l, object)

    nbits = 8 * dtype.itemsize

    if nwords is None:
        lmax = l.max()
        nwords = 0
        while lmax != 0:
            lmax >>= nbits
            nwords += 1

    arr = np.zeros(l.shape + (nwords,), dtype)

    mask = (1 << nbits) - 1

    for i in range(0, nwords):
        arr[...,i] = l & mask
        l = l >> nbits

    return arr

Giving:

>>> data = [1, 2, 3, 2**128 + 2**64 + 42]   # one of these is too big to fit in a uint64
>>> data_words = long_to_multi_word(data)
>>> data_words
array([[ 1,  0,  0],
       [ 2,  0,  0],
       [ 3,  0,  0],
       [42,  1,  1]], dtype=uint64)

Now the approach is just to use np.unpackbits:

# could have used long_to_multi_word(data, np.uint8), but would be slower
data_bytes = data_words.view(np.uint8)

data_bits = np.unpackbits(data_bytes, axis=-1)
n_bits = data_bits.sum(axis=0)

Which can be trivially converted into an average

Collectives™ on Stack Overflow

Efficiently compute the mean bits of many bitstrings in python / numpy?

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related