I am just getting started with Cython and would appreciate some pointers as to how to approach this process. I have identified a speed bottleneck in my code and would like to optimize the performance of a specific operation.
I have a pandas DataFrame trades that looks like this:
Codes Price Size
Time
2015-02-24 15:30:01-05:00 R6,IS 11.6100 100
2015-02-24 15:30:01-05:00 R6,IS 11.6100 100
2015-02-24 15:30:01-05:00 R6,IS 11.6100 100
2015-02-24 15:30:01-05:00 11.6100 375
2015-02-24 15:30:01-05:00 R6,IS 11.6100 100
... ... ... ...
2015-02-24 15:59:55-05:00 R6,IS 11.5850 100
2015-02-24 15:59:55-05:00 R6,IS 11.5800 200
2015-02-24 15:59:55-05:00 T 11.5850 100
2015-02-24 15:59:56-05:00 R6,IS 11.5800 175
2015-02-24 15:59:56-05:00 R6,IS 11.5800 225
[5187 rows x 3 columns]
I have a numpy array called codes:
array(['4', 'AP', 'CM', 'BP', 'FA', 'FI', 'NC', 'ND', 'NI', 'NO', 'PT',
'PV', 'PX', 'SD', 'WO'],
dtype='|S2')
I need to filter trades such that if any of the values in codes is included in trades['Codes'] that row is excluded. Currently I am doing this:
ix = trades.Codes.str.split(',').apply(lambda cs: not any(c in excludes for c in cs))
trades = trades[ix]
However, this is too slow and I need to make it faster. I want to use cython (as described here or maybe numba, it seems like cython is the better option.
I basically need a function like this:
def isinCodes(codes_array1, codes_array2):
for x in codes_array1:
for y in codes_array2:
if x == y: return True
return False
What types do I need to use when cythonizing?