I'm trying to process some data in pandas that looks like this in the CSV (it's much bigger):
2014.01.02,09:00,1.37562,1.37562,1.37545,1.37545,21
2014.01.02,09:01,1.37545,1.37550,1.37542,1.37546,18
2014.01.02,09:02,1.37546,1.37550,1.37546,1.37546,15
2014.01.02,09:03,1.37546,1.37563,1.37546,1.37559,39
2014.01.02,09:04,1.37559,1.37562,1.37555,1.37561,37
2014.01.02,09:05,1.37561,1.37564,1.37558,1.37561,35
2014.01.02,09:06,1.37561,1.37566,1.37558,1.37563,38
2014.01.02,09:07,1.37563,1.37567,1.37561,1.37566,42
2014.01.02,09:08,1.37570,1.37571,1.37564,1.37566,25
I imported it using:
raw_data = pd.read_csv('raw_data.csv', engine='c', header=None, index_col=0, names=['date', 'time', 'open', 'high', 'low', 'close', 'volume'], parse_dates=[[0,1]])
And got this (data):
open high low close volume
date_time
2014-01-02 09:00:00 1.37562 1.37562 1.37545 1.37545 21
2014-01-02 09:01:00 1.37545 1.37550 1.37542 1.37546 18
2014-01-02 09:02:00 1.37546 1.37550 1.37546 1.37546 15
2014-01-02 09:03:00 1.37546 1.37563 1.37546 1.37559 39
2014-01-02 09:04:00 1.37559 1.37562 1.37555 1.37561 37
2014-01-02 09:05:00 1.37561 1.37564 1.37558 1.37561 35
2014-01-02 09:06:00 1.37561 1.37566 1.37558 1.37563 38
2014-01-02 09:07:00 1.37563 1.37567 1.37561 1.37566 42
2014-01-02 09:08:00 1.37570 1.37571 1.37564 1.37566 25
2014-01-02 09:09:00 1.37566 1.37566 1.37555 1.37560 27
2014-01-02 09:10:00 1.37558 1.37559 1.37527 1.37527 44
2014-01-02 09:11:00 1.37527 1.37537 1.37527 1.37533 28
2014-01-02 09:12:00 1.37532 1.37534 1.37528 1.37528 22
2014-01-02 09:13:00 1.37534 1.37537 1.37521 1.37532 26
2014-01-02 09:14:00 1.37532 1.37536 1.37528 1.37534 16
2014-01-02 09:15:00 1.37534 1.37534 1.37526 1.37532 20
2014-01-02 09:16:00 1.37532 1.37533 1.37526 1.37529 23
2014-01-02 09:17:00 1.37529 1.37536 1.37529 1.37530 19
2014-01-02 09:18:00 1.37530 1.37530 1.37527 1.37527 19
2014-01-02 09:19:00 1.37527 1.37530 1.37527 1.37527 16
2014-01-02 09:20:00 1.37528 1.37542 1.37527 1.37541 22
2014-01-02 09:21:00 1.37542 1.37542 1.37536 1.37536 16
2014-01-02 09:22:00 1.37536 1.37559 1.37536 1.37559 32
Now, I want to construct an y array for the condition where I pick a X_period=10 from my data put it's data on X and then depending on the close of X_period+5 compared with the open of X_period I fill an y array:
X_period = 10
period = X_period + 5
columns = data.shape[1]
X = np.zeros((len(self.data)-period, columns*X_period), dtype=np.float)
y = np.zeros(len(data)-period, dtype=np.int)
for i in range(len(data)-period):
input_data = data.ix[:, 0:columns].iloc[i:i+X_period]
X[i] = np.array(input_data, dtype=np.float).ravel()
if float(data['close'].iloc[i+period-1]) > float(self.data['open'].iloc[i+self.X_period-1]):
self.y[i] = 1
elif float(data['close'].iloc[i+period-1]) < float(self.data['open'].iloc[i+self.X_period-1]):
self.y[i] = 2
Now, this does the job but it's very slow. Any ideia on how to speed this up?