0

I'm trying to improve on the time taken in adding two fixed length arrays. I must convert 2 strings of bytes into 2 short arrays of fixed length and then add the two arrays together, finally outputting the resultant array as a string of bytes.

Currently I have:

import cython
cimport numpy as np
import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def cython_layer( char* c_string1, char* c_string2, int length ):
    cdef np.ndarray[ np.int16_t, ndim=1 ] np_orig = np.fromstring( c_string1[:length], np.int16, count=length//2 )
    cdef np.ndarray[ np.int16_t, ndim=1 ] np_new  = np.fromstring( c_string2[:length], np.int16, count=length//2 )
    res = np_orig + np_new
    return res.tostring() 

however, the simpler numpy only method yields a very similar (better) performance:

def layer(self, orig, new, length):
    np_orig = fromstring(orig, np.int16, count=length // 2)
    np_new  = fromstring(new,  np.int16, count=length  // 2)
    res     = np_orig + np_new 
    return res.tostring()

Is it possible to improve on numpy speed for this simple example ? My gut says yes but I don't have enough of a handle on Cython to improve anymore. Using Ipython %timeit magic I've clocked the functions at:

100000 loops, best of 3: 5.79 µs per loop    # python + numpy
100000 loops, best of 3: 8.77 µs per loop    # cython + numpy

e.g:

a = np.array( range(1024), dtype=np.int16).tostring()
layer(a,a,len(a)) == cython_layer(a,a,len(a))
# True
%timeit layer(a, a, len(a) )
# 100000 loops, best of 3: 6.06 µs per loop
%timeit cython_layer(a, a, len(a))
# 100000 loops, best of 3: 9.19 µs per loop

edit: changes layer to show size=len(orig)//2 orig and new are both byte arrays of length 2048. Converting them to shorts (np.int16) results in an output array of size 1024.

edit2: I'm an idiot.

edit3: example in action

5
  • So how are you calling this function? Commented Jun 18, 2017 at 18:40
  • What's chunk_size? Your code as posted doesn't work... I think one issue is that your char* are probably autoconverted from str your function being called and then autoconverted to str (i.e. unnecessarily copied) before being passed to np.fromstring. Commented Jun 18, 2017 at 18:59
  • @DavidW sorry, had to_string() instead of tostring(). I've also updated the python + numpy solution to implicitly use the length of the byte array. are you suggesting that np.fromstring(char) would work ? because it converts only the first 48 bytes to short. Commented Jun 19, 2017 at 20:03
  • No - I'm just suggesting that it gets converted char*->str->np.array and so it ends up being copied twice. I don't know if that's easily avoidable though. Commented Jun 19, 2017 at 20:18
  • Is there any chance you can add a full working example, including whatever benchmark you're using? Commented Jun 19, 2017 at 21:17

1 Answer 1

2

One solution is to skip the numpy arrays and just use C pointers:

from cpython.bytes cimport PyBytes_FromStringAndSize
from libc.stdint cimport int16_t

def layer2(char* orig, char* new, length):
    cdef:
        bytes res = PyBytes_FromStringAndSize(NULL,2*(length//2))
        char* res_as_charp = res
        int16_t* orig_as_int16p = <int16_t*>orig
        int16_t* new_as_int16p = <int16_t*>new
        int16_t* res_as_int16p = <int16_t*>res_as_charp       
        Py_ssize_t i


    for i in range(length//2):
        res_as_int16p[i] = orig_as_int16p[i] + new_as_int16p[i]

    return res

Essentially, I create an empty string for the result using the C API function PyBytes_FromStringAndSize and modify that. The advantages of that are that unlike your version the both the inputs and outputs are used as is and not copied. Note that the only situation where you're allowed to modify Python strings like this is when you've just created an new one using PyBytes_FromStringAndSize(NULL,length) - this is in the C API documentation.

I then get a char* to it (doesn't copy the data, just points to existing data).

I then cast the char* for both inputs and the output to be int16_t* - this just changes how the memory is interpreted.

I then loop over the array doing the addition and using pointer indexing.

In terms of speed, this is about 8 times faster than the Python implementation for short strings (length<100). This is largely due to the fixed Python overhead of function calls are creating numpy arrays I believe. For longer strings (length>=100000) my version actually slower slightly. I suspect numpy has a better vectorized/parallelized loop for the addition.


Extra notes

Code shown is form Python 3 - for Python 2 you want PyString_... instead of PyBytes_...

You can get a slight improvement (~10-20%) on your pure Python version by using np.frombuffer instead of np.fromstring. This avoids copying the inputs.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @DavidW - that works an absolute charm. Great explanation too. For anyone interested I saw a substantial performance increase. %timeit layer(byts,byts,len(byts)) #100000 loops, best of 3: 5.83 µs per loop %timeit cython_layer(byts,byts,len(byts)) #1000000 loops, best of 3: 438 ns per loop

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.