1

CSV file may not be clean (lines with inconsistent number of elements), unclean lines would need to be disregarded. String manipulation is required during processing.

Example input:

20150701 20:00:15.173,0.5019,0.91665

Desired output: float32 (pseudo-date, seconds in the day, f3, f4)

0.150701 72015.173 0.5019 0.91665 (+ the trailing trash floats usually get)

The CSV file is also very big, the numpy array in memory would be expected to take 5-10 GB, CSV file is over 30GB.

Looking for an efficient way to process the CSV file and end up with a numpy array.

Current solution: use csv module, process line by line and use a list() as a buffer that later gets turned to numpy array with asarray(). Problem is, during the turning process memory consumption is doubled and the copying process adds execution overhead.

Numpy's genfromtxt and loadtxt don't appear to be able to process the data as desired.

3
  • can you post your code? Commented Jan 20, 2016 at 19:51
  • Do (Can) you know, in advance, how many rows the final array has? Commented Jan 20, 2016 at 20:01
  • Take a look at [pandas.read_csv] (pandas.pydata.org/pandas-docs/version/0.17.1/generated/…). There is a lot of tools for dates format in particular. Commented Jan 20, 2016 at 20:12

3 Answers 3

3

If you know in advance how many rows are in the data, you could dispense with the intermediate list and write directly to the array.

import numpy as np

no_rows = 5
no_columns = 4

a = np.zeros((no_rows, no_columns), dtype = np.float)

with open('myfile') as f:
    for i, line in enumerate(f):
        a[i,:] = cool_function_that_returns_formatted_data(line)
Sign up to request clarification or add additional context in comments.

2 Comments

Counting lines is trivial unless it's a tape drive, I was hoping there was some way in python to use a buffer you can expand and then treat it as a numpy array, but failing that your solution is what comes to mind and provided that the number of bad lines is not large, very efficient.
@dingrite - you have to process the string data before putting it into an array and I understood that there were memory constraints that couldn't handle two copies of the data. So it seems like you need to process the data in chunks and add to an existing array. Did you try a pandas solution as suggested by others? Maybe there is a way with somthing in the IO module or with mmap which I really don't quite understand.
1

did you think for using pandas read_csv (with engine='C')

I find it as one of the best and easy solutions to handling csv. I worked with 4GB file and it worked for me.

import pandas as pd
df=pd.read_csv('abc.csv',engine='C')
print(df.head(10))

Comments

0

I think i/o capability of pandas is the best way to get data into a numpy array. Specifically the read_csv method will read into a pandas DataFrame. You can then access the underlying numpy array using the as_matrix method of the returned DataFrame.

1 Comment

Can you give an example of that? The documentation doesn't show anything other than examples where out=None. I think it's supposed to look something like np.divide(train_inputs, 255, out=train_inputs), with no return value. But I get an error: TypeError: No loop matching the specified signature and casting

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.