2

I tried different things for too long now.

How do I load csv data containing dates into a numpy array? This is what doesn't work. It creates a single line with everything that is supposed to be a line now being in a single cell.

import io
import numpy as np
import datetime as dt


def date_parser(d_bytes):
    s = d_bytes.decode('utf-8')
    return np.datetime64(dt.datetime.strptime(s, "%Y-%m-%d %H:%M:%S"))


def read_csv():
    five_min_candles_str = """2020-06-01 17:05:00,9506.01,9523.31,9500.0,9514.52
2020-06-01 17:10:00,9513.44,9525.22,9500.32,9522.0
2020-06-01 17:15:00,9521.56,9525.59,9513.75,9523.53
2020-06-01 17:20:00,9523.21,9525.53,9518.78,9524.55
2020-06-01 17:25:00,9524.55,9538.4,9522.93,9528.73
2020-06-01 17:30:00,9528.73,9548.98,9527.95,9543.72
2020-06-01 17:35:00,9542.71,9547.34,9536.57,9543.66
2020-06-01 17:40:00,9543.67,9543.67,9530.0,9531.85
2020-06-01 17:45:00,9530.84,9535.01,9524.1,9526.75
2020-06-01 17:50:00,9526.47,9538.64,9521.87,9534.57
2020-06-01 17:55:00,9534.58,9548.9,9533.04,9546.98
2020-06-01 18:00:00,9548.18,9558.9,9536.99,9556.25
2020-06-01 18:05:00,9556.15,9579.8,9547.7,9574.09
2020-06-01 18:10:00,9575.0,9592.59,9571.3,9573.93
2020-06-01 18:15:00,9573.68,9610.0,9569.6,9597.78
2020-06-01 18:20:00,9597.78,9598.85,9578.0,9591.39
                """
    nparray = np.genfromtxt(io.StringIO(five_min_candles_str),
                            delimiter=',',
                            dtype=[('Timestamp','datetime64[us]'),
                                   ('Open','object'),
                                   ('High','object'),
                                   ('Low','object'),
                                   ('Close','object')],
                            converters={0: date_parser},
                            )
    print(nparray)


if __name__ == "__main__":
    read_csv()

A solution or hint would be much appreciated!

Edit: It turned out that it was indeed working already but I expected an 2D array while it became an array of tuples after I've added the types or the converter. The reason for that are the different types in a row. See the other SO question

I marked the answer below as correct anyways as I like it more because it doesn't need any custom parsing of the date and I also like the splitlines() solution more compared to io.StringIO()

10
  • Why don't you use pandas? Commented Jun 2, 2020 at 20:43
  • Because I would like to iterate over the rows which I learned is an anti pattern with pandas. Commented Jun 2, 2020 at 20:45
  • row iteration is discourage if you can find vectorized function, but it's there for you to use at any time. And in this case, pandas helps you simplify your intake of csv file a lot. Commented Jun 2, 2020 at 20:46
  • The data is not so diverse so the reading would be pretty much done. If I can find a solution here numpy would still be the preferred way to go I guess. But I don't have much experience in both of them right now to really judge that. Commented Jun 2, 2020 at 20:59
  • 1
    Row iteration in pandas is no worse than in numpy. We see plenty of SO using pandas apply. For speed it is nice to operate on the whole dataframe/array, but some tasks might be too complicated to do that. Commented Jun 2, 2020 at 21:51

2 Answers 2

3
In [53]: five_min_candles_str = """2020-06-01 17:05:00,9506.01,9523.31,9500.0,95
    ...: 14.52 
    ...: 2020-06-01 17:10:00,9513.44,9525.22,9500.32,9522.0 
    ...: 2020-06-01 17:15:00,9521.56,9525.59,9513.75,9523.53 
    ...: 2020-06-01 17:20:00,9523.21,9525.53,9518.78,9524.55 
    ...: 2020-06-01 17:25:00,9524.55,9538.4,9522.93,9528.73 
    ...: 2020-06-01 17:30:00,9528.73,9548.98,9527.95,9543.72 
    ...: 2020-06-01 17:35:00,9542.71,9547.34,9536.57,9543.66 
    ...: 2020-06-01 17:40:00,9543.67,9543.67,9530.0,9531.85 
    ...: 2020-06-01 17:45:00,9530.84,9535.01,9524.1,9526.75 
    ...: 2020-06-01 17:50:00,9526.47,9538.64,9521.87,9534.57 
    ...: 2020-06-01 17:55:00,9534.58,9548.9,9533.04,9546.98 
    ...: 2020-06-01 18:00:00,9548.18,9558.9,9536.99,9556.25 
    ...: 2020-06-01 18:05:00,9556.15,9579.8,9547.7,9574.09 
    ...: 2020-06-01 18:10:00,9575.0,9592.59,9571.3,9573.93 
    ...: 2020-06-01 18:15:00,9573.68,9610.0,9569.6,9597.78 
    ...: 2020-06-01 18:20:00,9597.78,9598.85,9578.0,9591.39 
    ...: """                                                                    

Let's see how numpy handles these date strings. It's not as robust as pandas, but:

In [55]: np.array('2020-06-01 17:05:00', 'datetime64[s]')                       
Out[55]: array('2020-06-01T17:05:00', dtype='datetime64[s]')

but looks ok. The space between date and time is ok ('T' also works).

So lets try a fully automatic dtype:

In [56]: data=np.genfromtxt(five_min_candles_str.splitlines(), delimiter=',', dt
    ...: ype=None, encoding=True)                                               
In [57]: data                                                                   
Out[57]: 
array([('2020-06-01 17:05:00', 9506.01, 9523.31, 9500.  , 9514.52),
       ('2020-06-01 17:10:00', 9513.44, 9525.22, 9500.32, 9522.  ),
       ('2020-06-01 17:15:00', 9521.56, 9525.59, 9513.75, 9523.53),
        ...
       ('2020-06-01 18:20:00', 9597.78, 9598.85, 9578.  , 9591.39)],
      dtype=[('f0', '<U19'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])

SO we need to specify the datetime dtype (editing that dtype):

In [58]: dt = [('f0', 'datetime64[s]'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')] 

In [59]: data=np.genfromtxt(five_min_candles_str.splitlines(), delimiter=',', dtype=dt, encoding=True)                                                 
In [60]: data                                                                   
Out[60]: 
array([('2020-06-01T17:05:00', 9506.01, 9523.31, 9500.  , 9514.52),
       ('2020-06-01T17:10:00', 9513.44, 9525.22, 9500.32, 9522.  ),
       ('2020-06-01T17:15:00', 9521.56, 9525.59, 9513.75, 9523.53),
       ('2020-06-01T17:20:00', 9523.21, 9525.53, 9518.78, 9524.55),
      ...
       ('2020-06-01T18:20:00', 9597.78, 9598.85, 9578.  , 9591.39)],
      dtype=[('f0', '<M8[s]'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])
Sign up to request clarification or add additional context in comments.

1 Comment

Ok, got it. The data was looking wrong in the IDE but it was indeed ok. I still like your answer more as there is no parsing needed.
1

for completeness, this is pretty much a one-liner using pandas:

from  io import StringIO
import pandas as pd

s="""2020-06-01 17:05:00,9506.01,9523.31,9500.0,9514.52
2020-06-01 17:10:00,9513.44,9525.22,9500.32,9522.0
2020-06-01 17:15:00,9521.56,9525.59,9513.75,9523.53
2020-06-01 17:20:00,9523.21,9525.53,9518.78,9524.55
2020-06-01 17:25:00,9524.55,9538.4,9522.93,9528.73
2020-06-01 17:30:00,9528.73,9548.98,9527.95,9543.72
2020-06-01 17:35:00,9542.71,9547.34,9536.57,9543.66
2020-06-01 17:40:00,9543.67,9543.67,9530.0,9531.85
2020-06-01 17:45:00,9530.84,9535.01,9524.1,9526.75
2020-06-01 17:50:00,9526.47,9538.64,9521.87,9534.57
2020-06-01 17:55:00,9534.58,9548.9,9533.04,9546.98
2020-06-01 18:00:00,9548.18,9558.9,9536.99,9556.25
2020-06-01 18:05:00,9556.15,9579.8,9547.7,9574.09
2020-06-01 18:10:00,9575.0,9592.59,9571.3,9573.93
2020-06-01 18:15:00,9573.68,9610.0,9569.6,9597.78
2020-06-01 18:20:00,9597.78,9598.85,9578.0,9591.39"""

df = pd.read_csv(StringIO(s), names=['Timestamp','Open','High','Low','Close'],
                 parse_dates=[0])

for c in df.columns:
    print(f"{c} - {df[c].dtype}")
# Timestamp - datetime64[ns]
# Open - float64
# High - float64
# Low - float64
# Close - float64

1 Comment

Yea, that worked pretty much out of the box for me as well. parse_dates=True is enough btw.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.