Python: Fastest way of packing a 2d array of binary values into UINT64 array

Question

I have a 2D UINT8 numpy array of size (149797, 64). Each of the elements are either 0 or 1. I want to pack these binary values in each row into a UINT64 value so that i get a UINT64 array of shape 149797 as a result. I tried the following code using numpy bitpack function.

test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
col_pack=np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)

The packbits function takes about 10 ms to execute. A simple reshaping of this array itself seems to take around 7 ms.I also tried iterating over 2d numpy array using shifting operations to achieve the same result; but there was no speed improvement.

Finally i also want to compile it using numba for CPU.

@njit
def shifting(bitlist):
    x=np.zeros(149797,dtype=np.uint64)  #54
    rows,cols=bitlist.shape
    for i in range(0,rows):             #56
      out=0
      for bit in range(0,cols):
         out = (out << 1) | bitlist[i][bit] # If i comment out bitlist, time=190 microsec
      x[i]=np.uint64(out)  # Reduces time to microseconds if line is commented in njit
    return x

It takes about 6 ms using njit.

Here is the parallel njit version

@njit(parallel=True)
def shifting(bitlist): 
    rows,cols=149797,64
    out=0
    z=np.zeros(rows,dtype=np.uint64)
    for i in prange(rows):
      for bit in range(cols):
         z[i] = (z[i] * 2) + bitlist[i,bit] # Time becomes 100 micro if i use 'out' instead of 'z[i] array'

    return z

It's slightly better wit 3.24ms execution time(google colab dual core 2.2Ghz) Currently, the python solution with swapbytes(Paul's) method seems to be the best one i.e 1.74 ms.

How can we further speed up this conversion? Is there scope for using any vectorization(or parallelization), bitarrays etc, for achieving speedup?

Ref: numpy packbits pack to uint16 array

On a 12 core machine(Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz),

Pauls method: 1595.0 microseconds (it does not use multicore, i suppose)

Numba code: 146.0 microseconds (aforementioned parallel-numba)

i.e around 10x speedup !!!

You can shave off about a third of the runtime by not using chained indexing for bitlist. Replace bitlist[i][bit] with bitlist[i, bit]. You are creating a non-trivial sized intermediate array at each iteration of the loop. When you do that rows * cols times, it adds up! — user3483203
– user3483203, Commented Feb 7, 2020 at 17:32
You can try parallelization (@njit(parallel=True) and instead of range nb.prange in the outer loop. The initialization with zeros is also not necessary (use np.empty instead). Do you also have many arrays to convert? — max9111
– max9111, Commented Feb 7, 2020 at 20:21
@user3483203 it improves speed from 6.2 sec to 5.4 sec in normal python; but there is no considerable change in numba jitted version.Also packbit version already gives 10ms. Pure python loop gives in range of seconds, packbits (10ms) and numba (so far) in range of 6 milliseconds and i am lookig forward to < 1ms. — anilsathyan7
– anilsathyan7, Commented Feb 8, 2020 at 6:11

Paul Panzer · Accepted Answer · 2020-02-07 21:49:24Z

2

You can get a sizeable speedup by using byteswap instead of reshaping etc.:

test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)

np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
# array([ 1079982015491401631,   246233595099746297, 16216705265283876830,
#        ...,  1943876987915462704, 14189483758685514703,
       12753669247696755125], dtype=uint64)
np.packbits(test).view(np.uint64).byteswap()
# array([ 1079982015491401631,   246233595099746297, 16216705265283876830,
#        ...,  1943876987915462704, 14189483758685514703,
       12753669247696755125], dtype=uint64)

timeit(lambda:np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64),number=100)
# 1.1054180909413844

timeit(lambda:np.packbits(test).view(np.uint64).byteswap(),number=100)
# 0.18370431219227612

answered Feb 7, 2020 at 21:49

Paul Panzer

53.3k3 gold badges60 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

anilsathyan7 Over a year ago

It gave 1.74 ms .. so far the best. I actually wanted to use the python version with numba jit to exploit vectorization, parallelization etc.; but unfortunately packbits function is not fully supported by numba. Thanks....

max9111 · Accepted Answer · 2020-02-09 19:24:11Z

1

A bit Numba solution (version 0.46/Windows).

Code

import numpy as np
import numba as nb

#with memory allocation
@nb.njit(parallel=True)
def shifting(bitlist):
    assert bitlist.shape[1]==64
    x=np.empty(bitlist.shape[0],dtype=np.uint64)

    for i in nb.prange(bitlist.shape[0]):
        out=np.uint64(0)
        for bit in range(bitlist.shape[1]):
            out = (out << 1) | bitlist[i,bit] 
        x[i]=out
    return x

#without memory allocation
@nb.njit(parallel=True)
def shifting_2(bitlist,x):
    assert bitlist.shape[1]==64

    for i in nb.prange(bitlist.shape[0]):
        out=np.uint64(0)
        for bit in range(bitlist.shape[1]):
            out = (out << 1) | bitlist[i,bit] 
        x[i]=out
    return x

Timings

test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)

#If you call this function multiple times, only allocating memory 
#once may be enough
x=np.empty(test.shape[0],dtype=np.uint64)

#Warmup first call takes significantly longer
res=shifting(test)
res=shifting_2(test,x)

%timeit res=shifting(test)
#976 µs ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit res=shifting_2(test,x)
#764 µs ± 63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.packbits(test).view(np.uint64).byteswap()
#8.07 ms ± 52.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
#17.9 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

answered Feb 9, 2020 at 19:24

max9111

6,5171 gold badge19 silver badges39 bronze badges

5 Comments

anilsathyan7 Over a year ago

I think it depends on you hardware (number of cores, memory type etc.), library versions etc. I tried your exact code on google colab with a dual core CPU (2.3Ghz, 2 threads per core) and results were as follows : - 4.06 ms for shifting, 4 ms for shifting_2, 2.22 ms per loop for byteswap-pack and 12.8 for reshape-pack. Packbits-byteswap seem to be fastest and memory allocation, empty (instead of zero)does not seem to make any difference here (it seems..). What do you think? Is it because of hardware and library version differences?

max9111 Over a year ago

@anilsathyan7 Are you using a special build of numpy, or is it a standard Linux build? There can be significant differences depending on how numpy is compiled (compiler and compiler flags). On Windows (Anaconda) timings can differ quite significantly on the same hardware. I measured quite high differences on other examples (standard Linux build was always faster).

anilsathyan7 Over a year ago

i.e 100 loops, best of 3: 4.1 ms per loop 100 loops, best of 3: 4 ms per loop 100 loops, best of 3: 2.06 ms per loop 100 loops, best of 3: 12.7 ms per loop, in order ....

anilsathyan7 Over a year ago

As i said I use 'default' numpy, numba etc. in colab .. it should be highly optimized

anilsathyan7 Over a year ago

I got the 10x speed-up with a high-end machine, using numba parallel method compared to python packbits method !!! I wonder,whether there is a parallel way of doing same code(packbits) with python only, using all cores ? That would be a fair comparison ...

Collectives™ on Stack Overflow

Python: Fastest way of packing a 2d array of binary values into UINT64 array

2 Answers 2

1 Comment

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related