How to load csv data containing dates into a numpy array?

Question

I tried different things for too long now.

How do I load csv data containing dates into a numpy array? This is what doesn't work. It creates a single line with everything that is supposed to be a line now being in a single cell.

import io
import numpy as np
import datetime as dt


def date_parser(d_bytes):
    s = d_bytes.decode('utf-8')
    return np.datetime64(dt.datetime.strptime(s, "%Y-%m-%d %H:%M:%S"))


def read_csv():
    five_min_candles_str = """2020-06-01 17:05:00,9506.01,9523.31,9500.0,9514.52
2020-06-01 17:10:00,9513.44,9525.22,9500.32,9522.0
2020-06-01 17:15:00,9521.56,9525.59,9513.75,9523.53
2020-06-01 17:20:00,9523.21,9525.53,9518.78,9524.55
2020-06-01 17:25:00,9524.55,9538.4,9522.93,9528.73
2020-06-01 17:30:00,9528.73,9548.98,9527.95,9543.72
2020-06-01 17:35:00,9542.71,9547.34,9536.57,9543.66
2020-06-01 17:40:00,9543.67,9543.67,9530.0,9531.85
2020-06-01 17:45:00,9530.84,9535.01,9524.1,9526.75
2020-06-01 17:50:00,9526.47,9538.64,9521.87,9534.57
2020-06-01 17:55:00,9534.58,9548.9,9533.04,9546.98
2020-06-01 18:00:00,9548.18,9558.9,9536.99,9556.25
2020-06-01 18:05:00,9556.15,9579.8,9547.7,9574.09
2020-06-01 18:10:00,9575.0,9592.59,9571.3,9573.93
2020-06-01 18:15:00,9573.68,9610.0,9569.6,9597.78
2020-06-01 18:20:00,9597.78,9598.85,9578.0,9591.39
                """
    nparray = np.genfromtxt(io.StringIO(five_min_candles_str),
                            delimiter=',',
                            dtype=[('Timestamp','datetime64[us]'),
                                   ('Open','object'),
                                   ('High','object'),
                                   ('Low','object'),
                                   ('Close','object')],
                            converters={0: date_parser},
                            )
    print(nparray)


if __name__ == "__main__":
    read_csv()

A solution or hint would be much appreciated!

Edit: It turned out that it was indeed working already but I expected an 2D array while it became an array of tuples after I've added the types or the converter. The reason for that are the different types in a row. See the other SO question

I marked the answer below as correct anyways as I like it more because it doesn't need any custom parsing of the date and I also like the splitlines() solution more compared to io.StringIO()

Because I would like to iterate over the rows which I learned is an anti pattern with pandas. — Sandro
– Sandro, Commented Jun 2, 2020 at 20:45
row iteration is discourage if you can find vectorized function, but it's there for you to use at any time. And in this case, pandas helps you simplify your intake of csv file a lot. — Quang Hoang
– Quang Hoang, Commented Jun 2, 2020 at 20:46
The data is not so diverse so the reading would be pretty much done. If I can find a solution here numpy would still be the preferred way to go I guess. But I don't have much experience in both of them right now to really judge that. — Sandro
– Sandro, Commented Jun 2, 2020 at 20:59
Row iteration in pandas is no worse than in numpy. We see plenty of SO using pandas apply. For speed it is nice to operate on the whole dataframe/array, but some tasks might be too complicated to do that. — hpaulj
– hpaulj, Commented Jun 2, 2020 at 21:51

hpaulj · Accepted Answer · 2020-06-03 02:31:45Z

In [53]: five_min_candles_str = """2020-06-01 17:05:00,9506.01,9523.31,9500.0,95
    ...: 14.52 
    ...: 2020-06-01 17:10:00,9513.44,9525.22,9500.32,9522.0 
    ...: 2020-06-01 17:15:00,9521.56,9525.59,9513.75,9523.53 
    ...: 2020-06-01 17:20:00,9523.21,9525.53,9518.78,9524.55 
    ...: 2020-06-01 17:25:00,9524.55,9538.4,9522.93,9528.73 
    ...: 2020-06-01 17:30:00,9528.73,9548.98,9527.95,9543.72 
    ...: 2020-06-01 17:35:00,9542.71,9547.34,9536.57,9543.66 
    ...: 2020-06-01 17:40:00,9543.67,9543.67,9530.0,9531.85 
    ...: 2020-06-01 17:45:00,9530.84,9535.01,9524.1,9526.75 
    ...: 2020-06-01 17:50:00,9526.47,9538.64,9521.87,9534.57 
    ...: 2020-06-01 17:55:00,9534.58,9548.9,9533.04,9546.98 
    ...: 2020-06-01 18:00:00,9548.18,9558.9,9536.99,9556.25 
    ...: 2020-06-01 18:05:00,9556.15,9579.8,9547.7,9574.09 
    ...: 2020-06-01 18:10:00,9575.0,9592.59,9571.3,9573.93 
    ...: 2020-06-01 18:15:00,9573.68,9610.0,9569.6,9597.78 
    ...: 2020-06-01 18:20:00,9597.78,9598.85,9578.0,9591.39 
    ...: """

Let's see how numpy handles these date strings. It's not as robust as pandas, but:

In [55]: np.array('2020-06-01 17:05:00', 'datetime64[s]')                       
Out[55]: array('2020-06-01T17:05:00', dtype='datetime64[s]')

but looks ok. The space between date and time is ok ('T' also works).

So lets try a fully automatic dtype:

In [56]: data=np.genfromtxt(five_min_candles_str.splitlines(), delimiter=',', dt
    ...: ype=None, encoding=True)                                               
In [57]: data                                                                   
Out[57]: 
array([('2020-06-01 17:05:00', 9506.01, 9523.31, 9500.  , 9514.52),
       ('2020-06-01 17:10:00', 9513.44, 9525.22, 9500.32, 9522.  ),
       ('2020-06-01 17:15:00', 9521.56, 9525.59, 9513.75, 9523.53),
        ...
       ('2020-06-01 18:20:00', 9597.78, 9598.85, 9578.  , 9591.39)],
      dtype=[('f0', '<U19'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])

SO we need to specify the datetime dtype (editing that dtype):

In [58]: dt = [('f0', 'datetime64[s]'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')] 

In [59]: data=np.genfromtxt(five_min_candles_str.splitlines(), delimiter=',', dtype=dt, encoding=True)                                                 
In [60]: data                                                                   
Out[60]: 
array([('2020-06-01T17:05:00', 9506.01, 9523.31, 9500.  , 9514.52),
       ('2020-06-01T17:10:00', 9513.44, 9525.22, 9500.32, 9522.  ),
       ('2020-06-01T17:15:00', 9521.56, 9525.59, 9513.75, 9523.53),
       ('2020-06-01T17:20:00', 9523.21, 9525.53, 9518.78, 9524.55),
      ...
       ('2020-06-01T18:20:00', 9597.78, 9598.85, 9578.  , 9591.39)],
      dtype=[('f0', '<M8[s]'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])

Ok, got it. The data was looking wrong in the IDE but it was indeed ok. I still like your answer more as there is no parsing needed.

FObersteiner · Accepted Answer · 2020-06-03 05:42:00Z

1

for completeness, this is pretty much a one-liner using pandas:

from  io import StringIO
import pandas as pd

s="""2020-06-01 17:05:00,9506.01,9523.31,9500.0,9514.52
2020-06-01 17:10:00,9513.44,9525.22,9500.32,9522.0
2020-06-01 17:15:00,9521.56,9525.59,9513.75,9523.53
2020-06-01 17:20:00,9523.21,9525.53,9518.78,9524.55
2020-06-01 17:25:00,9524.55,9538.4,9522.93,9528.73
2020-06-01 17:30:00,9528.73,9548.98,9527.95,9543.72
2020-06-01 17:35:00,9542.71,9547.34,9536.57,9543.66
2020-06-01 17:40:00,9543.67,9543.67,9530.0,9531.85
2020-06-01 17:45:00,9530.84,9535.01,9524.1,9526.75
2020-06-01 17:50:00,9526.47,9538.64,9521.87,9534.57
2020-06-01 17:55:00,9534.58,9548.9,9533.04,9546.98
2020-06-01 18:00:00,9548.18,9558.9,9536.99,9556.25
2020-06-01 18:05:00,9556.15,9579.8,9547.7,9574.09
2020-06-01 18:10:00,9575.0,9592.59,9571.3,9573.93
2020-06-01 18:15:00,9573.68,9610.0,9569.6,9597.78
2020-06-01 18:20:00,9597.78,9598.85,9578.0,9591.39"""

df = pd.read_csv(StringIO(s), names=['Timestamp','Open','High','Low','Close'],
                 parse_dates=[0])

for c in df.columns:
    print(f"{c} - {df[c].dtype}")
# Timestamp - datetime64[ns]
# Open - float64
# High - float64
# Low - float64
# Close - float64

answered Jun 3, 2020 at 5:42

FObersteiner

26.1k9 gold badges60 silver badges94 bronze badges

1 Comment

Sandro Over a year ago

Yea, that worked pretty much out of the box for me as well. parse_dates=True is enough btw.

Collectives™ on Stack Overflow

How to load csv data containing dates into a numpy array?

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related