Speedup an IF check over a numpy array/pandas dataframe

Question

I'm trying to process some data in pandas that looks like this in the CSV (it's much bigger):

2014.01.02,09:00,1.37562,1.37562,1.37545,1.37545,21
2014.01.02,09:01,1.37545,1.37550,1.37542,1.37546,18
2014.01.02,09:02,1.37546,1.37550,1.37546,1.37546,15
2014.01.02,09:03,1.37546,1.37563,1.37546,1.37559,39
2014.01.02,09:04,1.37559,1.37562,1.37555,1.37561,37
2014.01.02,09:05,1.37561,1.37564,1.37558,1.37561,35
2014.01.02,09:06,1.37561,1.37566,1.37558,1.37563,38
2014.01.02,09:07,1.37563,1.37567,1.37561,1.37566,42
2014.01.02,09:08,1.37570,1.37571,1.37564,1.37566,25

I imported it using:

raw_data = pd.read_csv('raw_data.csv', engine='c', header=None, index_col=0, names=['date', 'time', 'open', 'high', 'low', 'close', 'volume'], parse_dates=[[0,1]])

And got this (data):

                        open     high      low    close  volume
date_time                                                      
2014-01-02 09:00:00  1.37562  1.37562  1.37545  1.37545      21
2014-01-02 09:01:00  1.37545  1.37550  1.37542  1.37546      18
2014-01-02 09:02:00  1.37546  1.37550  1.37546  1.37546      15
2014-01-02 09:03:00  1.37546  1.37563  1.37546  1.37559      39
2014-01-02 09:04:00  1.37559  1.37562  1.37555  1.37561      37
2014-01-02 09:05:00  1.37561  1.37564  1.37558  1.37561      35
2014-01-02 09:06:00  1.37561  1.37566  1.37558  1.37563      38
2014-01-02 09:07:00  1.37563  1.37567  1.37561  1.37566      42
2014-01-02 09:08:00  1.37570  1.37571  1.37564  1.37566      25
2014-01-02 09:09:00  1.37566  1.37566  1.37555  1.37560      27
2014-01-02 09:10:00  1.37558  1.37559  1.37527  1.37527      44
2014-01-02 09:11:00  1.37527  1.37537  1.37527  1.37533      28
2014-01-02 09:12:00  1.37532  1.37534  1.37528  1.37528      22
2014-01-02 09:13:00  1.37534  1.37537  1.37521  1.37532      26
2014-01-02 09:14:00  1.37532  1.37536  1.37528  1.37534      16
2014-01-02 09:15:00  1.37534  1.37534  1.37526  1.37532      20
2014-01-02 09:16:00  1.37532  1.37533  1.37526  1.37529      23
2014-01-02 09:17:00  1.37529  1.37536  1.37529  1.37530      19
2014-01-02 09:18:00  1.37530  1.37530  1.37527  1.37527      19
2014-01-02 09:19:00  1.37527  1.37530  1.37527  1.37527      16
2014-01-02 09:20:00  1.37528  1.37542  1.37527  1.37541      22
2014-01-02 09:21:00  1.37542  1.37542  1.37536  1.37536      16
2014-01-02 09:22:00  1.37536  1.37559  1.37536  1.37559      32

Now, I want to construct an y array for the condition where I pick a X_period=10 from my data put it's data on X and then depending on the close of X_period+5 compared with the open of X_period I fill an y array:

X_period = 10
period = X_period + 5
columns = data.shape[1]
X = np.zeros((len(self.data)-period, columns*X_period), dtype=np.float)
y = np.zeros(len(data)-period, dtype=np.int)
for i in range(len(data)-period):
    input_data = data.ix[:, 0:columns].iloc[i:i+X_period]
    X[i] = np.array(input_data, dtype=np.float).ravel()
    if float(data['close'].iloc[i+period-1]) > float(self.data['open'].iloc[i+self.X_period-1]):
        self.y[i] = 1
    elif float(data['close'].iloc[i+period-1]) < float(self.data['open'].iloc[i+self.X_period-1]):
        self.y[i] = 2

Now, this does the job but it's very slow. Any ideia on how to speed this up?

Jianxun Li · Accepted Answer · 2015-08-22 14:15:48Z

One way to speed up this process is to use cython. To compare the performance boost, I test the python function and cython function on a sample dataset with intraday minutely bar OHLC+SIZE data for a single day. Dataset looks like this:

print(data)

                      open   high    low  close   SIZE
DATE_TIME                                             
2011-01-03 09:30:00  41.56  41.56  41.43  41.46   4025
2011-01-03 09:31:00  41.50  41.74  41.49  41.74   4377
2011-01-03 09:32:00  41.75  41.75  41.70  41.70   2700
2011-01-03 09:33:00  41.72  41.73  41.72  41.72   3000
2011-01-03 09:34:00  41.73  41.75  41.71  41.75   1000
2011-01-03 09:35:00  41.75  41.82  41.75  41.80   7900
2011-01-03 09:36:00  41.81  41.81  41.75  41.77   3550
2011-01-03 09:37:00  41.77  41.81  41.76  41.81   3008
...                    ...    ...    ...    ...    ...
2011-01-03 15:53:00  41.95  41.96  41.93  41.95   7675
2011-01-03 15:54:00  41.94  41.95  41.92  41.94   9469
2011-01-03 15:55:00  41.94  41.94  41.89  41.89   9700
2011-01-03 15:56:00  41.89  41.89  41.88  41.88  10000
2011-01-03 15:57:00  41.88  41.89  41.86  41.86  20978
2011-01-03 15:58:00  41.86  41.86  41.84  41.86  22770
2011-01-03 15:59:00  41.85  41.86  41.83  41.85  25276
2011-01-03 16:00:00  41.85  41.85  41.85  41.85    100

Python performance:

import py_func
%timeit -n3 -r10 X_py, y_py = py_func.py_func(data)

3 loops, best of 10: 153 ms per loop

Cython performance:

import cy_func
%timeit -n3 -r10 X_cy, y_cy = cy_func.cy_func(data.values)

3 loops, best of 10: 1.97 ms per loop

So, we've seen an almost 2 order of magnitude performance boost from cython. To test whether results from python and cython functions are equal

from numpy.testing import assert_array_almost_equal

assert_array_almost_equal(X_py, X_cy)
assert_array_almost_equal(y_py, y_cy)

Here are the codes.

Your original mixed python/numpy code (py_func.py) as the benchmark:

# filename: py_func.py

import numpy as np
import pandas as pd


def py_func(data):

    X_period = 10
    period = X_period + 5
    columns = data.shape[1]
    X = np.zeros((len(data)-period, columns*X_period), dtype=np.float)
    y = np.zeros(len(data)-period, dtype=np.int)

    for i in range(len(data)-period):

        input_data = data.ix[:, 0:columns].iloc[i:i+X_period]
        X[i] = np.array(input_data, dtype=np.float).ravel()
        if float(data['close'].iloc[i+period-1]) > float(data['open'].iloc[i+X_period-1]):
            y[i] = 1
        elif float(data['close'].iloc[i+period-1]) < float(data['open'].iloc[i+X_period-1]):
            y[i] = 2

    return X, y

By adding static typing to variables and use numpy.array buffer, we can modify the original python code to the cython code (cy_func.pyx) as below:

# filename: cy_func.pyx

import numpy as np
cimport numpy as np
cimport cython


@cython.boundscheck(False)
@cython.wraparound(False)
def cy_func(np.ndarray[double, ndim=2] data):
    cdef int X_period = 10
    cdef int period = X_period + 5
    cdef int rows = data.shape[0]
    cdef int columns = data.shape[1]

    cdef np.ndarray[np.float_t, ndim=2] X = np.zeros((rows-period, columns*X_period), dtype=np.float)
    cdef np.ndarray[np.int64_t, ndim=1] y = np.zeros(rows-period, dtype=np.int64)

    cdef unsigned int i, N
    N = rows - period

    cdef int OPEN = 0
    cdef int CLOSE = 3
    cdef np.ndarray[double, ndim=2] input_data

    for i in range(N):

        input_data = data[i:i+X_period]
        X[i,:] = input_data.reshape(columns*X_period,)

        if data[i+period-1,CLOSE] > data[i+X_period-1,OPEN]:
            y[i] = 1
        elif data[i+period-1,CLOSE] < data[i+X_period-1,OPEN]:
            y[i] = 2

    return X, y

To convert the cython source file to a python extension module, you can write the following setup.py

# filename: setup.py
from distutils.core import Extension, setup
from Cython.Build import cythonize
from numpy import get_include


ext = Extension(name='cy_func',
                sources=['cy_func.pyx'])

setup(name='cy_func', 
      ext_modules=cythonize(ext),
      include_dirs=[get_include()])

To run the setup.py, navigate to the directory containing your cy_func.pyx and setup.py, then on command line $

python setup.py build_ext --inplace

To use the compiled extension module:

# navigate to the cython code folder, or using sys.path.append()
%cd /home/Jian/Dropbox/Coding/Python/Cython/ex_stackoverflow1

# import cython module
import cy_func
# to use the module function, cy_func.cy_func(...)
%prun -s tottime -l 5 X_cy, y_cy = cy_func.cy_func(data.values)

Sad that you went to all that effort only to be ignored, pretty disrespectful, diabolical even,

ajsp · Accepted Answer · 2015-08-22 14:56:57Z

0

You should use a structured numpy array for indexing intensive work rather than pandas. Also you don't need the high, low, time, or date columns in your array slices, so you can get rid of those too. You should end up performing your work on an array like the following:

np_arr=np.array(csv(open,close),
                 dtype = [('open','float'),('close','float')])

And as Jianxun Li has mentioned, you can always take things into Cython if you become obsessed.

answered Aug 22, 2015 at 14:56

ajsp

2,69025 silver badges35 bronze badges

Collectives™ on Stack Overflow

Speedup an IF check over a numpy array/pandas dataframe

2 Answers 2

Here are the codes.

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Here are the codes.

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related