4

I have a few thousand bitstrings stored as longs. Each bitstring is 1024 bits. I would like to create an array of the ratios each bit is 1.

For example (pseudocode):

bs = [
    1 0 0 0,
    0 1 1 0,
    1 1 0 0,
    0 0 0 0
]
ratios(bs) => [0.5, 0.5, 0.25 0.0]

My current slow code is:

def mean_signature(bitstrings, bit_count):
    means = []
    for b in range(bit_count):
        m = sum((x >> b) & 1 for x in bitstrings) / len(bitstrings)
        means.append(m)
    return means

I am about to modify the code so the outer loop is over bitstrings, but think I must be missing something. Maybe using numpy bit arrays.

2
  • So, the input is array of 1s and 0s`? Commented Apr 18, 2017 at 12:05
  • No, the input is a list of longs. As far as I can tell it's not trivial to convert this to a numpy array - unless you use dtype=object (reference). Commented Apr 18, 2017 at 12:47

2 Answers 2

2

Here's one way you can do it, but it is probably not the most efficient method possible.

For the demonstration, I'll use 8 bit integers, but it will also work with your 1024 bit integers.

In [28]: bs = [0b11110000, 0b11111100, 0b11000000, 0b11111110, 0b00001100]

In [29]: bs
Out[29]: [240, 252, 192, 254, 12]

In [30]: nbits = 8

In [31]: bits = np.array([list(np.binary_repr(b, width=nbits)) for b in bs], dtype=np.uint8)

In [32]: bits
Out[32]: 
array([[1, 1, 1, 1, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 0],
       [0, 0, 0, 0, 1, 1, 0, 0]], dtype=uint8)

bits is the array containing the bits of the binary representation of each value. The ratio that you want is the mean of the columns:

In [33]: bits.mean(axis=0)
Out[33]: array([ 0.8,  0.8,  0.6,  0.6,  0.6,  0.6,  0.2,  0. ])

The order of those values is from the highest order bit to the lowest. It might be more natural for the index of the result to match the usual bit indices. For that, just reverse the result:

In [34]: bits.mean(axis=0)[::-1]
Out[34]: array([ 0. ,  0.2,  0.6,  0.6,  0.6,  0.6,  0.8,  0.8])
Sign up to request clarification or add additional context in comments.

1 Comment

I think the result must be reversed to correctly match the bit indices
0

The hardest step here is taking an object array of longs, and storing them in a numpy container suitable for vectorization. The following function breaks up a long (array) into little-endian words:

def long_to_multi_word(l, dtype=np.uint64, nwords=None):
    dtype = np.dtype(dtype)
    l = np.asarray(l, object)

    nbits = 8 * dtype.itemsize

    if nwords is None:
        lmax = l.max()
        nwords = 0
        while lmax != 0:
            lmax >>= nbits
            nwords += 1

    arr = np.zeros(l.shape + (nwords,), dtype)

    mask = (1 << nbits) - 1

    for i in range(0, nwords):
        arr[...,i] = l & mask
        l = l >> nbits

    return arr

Giving:

>>> data = [1, 2, 3, 2**128 + 2**64 + 42]   # one of these is too big to fit in a uint64
>>> data_words = long_to_multi_word(data)
>>> data_words
array([[ 1,  0,  0],
       [ 2,  0,  0],
       [ 3,  0,  0],
       [42,  1,  1]], dtype=uint64)

Now the approach is just to use np.unpackbits:

# could have used long_to_multi_word(data, np.uint8), but would be slower
data_bytes = data_words.view(np.uint8)

data_bits = np.unpackbits(data_bytes, axis=-1)
n_bits = data_bits.sum(axis=0)

Which can be trivially converted into an average

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.