Efficient way to process CSV file into a numpy array

Question

CSV file may not be clean (lines with inconsistent number of elements), unclean lines would need to be disregarded. String manipulation is required during processing.

Example input:

20150701 20:00:15.173,0.5019,0.91665

Desired output: float32 (pseudo-date, seconds in the day, f3, f4)

0.150701 72015.173 0.5019 0.91665 (+ the trailing trash floats usually get)

The CSV file is also very big, the numpy array in memory would be expected to take 5-10 GB, CSV file is over 30GB.

Looking for an efficient way to process the CSV file and end up with a numpy array.

Current solution: use csv module, process line by line and use a list() as a buffer that later gets turned to numpy array with asarray(). Problem is, during the turning process memory consumption is doubled and the copying process adds execution overhead.

Numpy's genfromtxt and loadtxt don't appear to be able to process the data as desired.

Do (Can) you know, in advance, how many rows the final array has? — wwii
– wwii, Commented Jan 20, 2016 at 20:01
Take a look at [pandas.read_csv] (pandas.pydata.org/pandas-docs/version/0.17.1/generated/…). There is a lot of tools for dates format in particular. — B. M.
– B. M., Commented Jan 20, 2016 at 20:12

wwii · Accepted Answer · 2016-01-20 20:09:13Z

3

If you know in advance how many rows are in the data, you could dispense with the intermediate list and write directly to the array.

import numpy as np

no_rows = 5
no_columns = 4

a = np.zeros((no_rows, no_columns), dtype = np.float)

with open('myfile') as f:
    for i, line in enumerate(f):
        a[i,:] = cool_function_that_returns_formatted_data(line)

answered Jan 20, 2016 at 20:09

wwii

23.9k7 gold badges42 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user2180519 Over a year ago

Counting lines is trivial unless it's a tape drive, I was hoping there was some way in python to use a buffer you can expand and then treat it as a numpy array, but failing that your solution is what comes to mind and provided that the number of bad lines is not large, very efficient.

wwii Over a year ago

@dingrite - you have to process the string data before putting it into an array and I understood that there were memory constraints that couldn't handle two copies of the data. So it seems like you need to process the data in chunks and add to an existing array. Did you try a pandas solution as suggested by others? Maybe there is a way with somthing in the IO module or with mmap which I really don't quite understand.

maswadkar · Accepted Answer · 2016-01-20 20:24:48Z

1

did you think for using pandas read_csv (with engine='C')

I find it as one of the best and easy solutions to handling csv. I worked with 4GB file and it worked for me.

import pandas as pd
df=pd.read_csv('abc.csv',engine='C')
print(df.head(10))

answered Jan 20, 2016 at 20:24

maswadkar

1,5421 gold badge13 silver badges23 bronze badges

Comments

Phil · Accepted Answer · 2016-01-20 20:18:40Z

0

I think i/o capability of pandas is the best way to get data into a numpy array. Specifically the read_csv method will read into a pandas DataFrame. You can then access the underlying numpy array using the as_matrix method of the returned DataFrame.

answered Jan 20, 2016 at 20:18

Phil

6,2843 gold badges34 silver badges65 bronze badges

1 Comment

Steve3p0 Over a year ago

Can you give an example of that? The documentation doesn't show anything other than examples where out=None. I think it's supposed to look something like np.divide(train_inputs, 255, out=train_inputs), with no return value. But I get an error: TypeError: No loop matching the specified signature and casting

Collectives™ on Stack Overflow

Efficient way to process CSV file into a numpy array

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related