18

I have a large csv file (~10GB), with around 4000 columns. I know that most of data i will expect is int8, so i set:

pandas.read_csv('file.dat', sep=',', engine='c', header=None, 
                na_filter=False, dtype=np.int8, low_memory=False)

Thing is, the final column (4000th position) is int32, is there away can i tell read_csv that use int8 by default, and at column 4000th, use int 32?

Thank you

1
  • 2
    One hack I can think of: Read all the columns as int8. Then read only the 4000th column as int32 using usecols. Then replace it in the first dataframe. Commented Jun 1, 2018 at 11:58

2 Answers 2

13

If you are certain of the number you could recreate the dictionary like this:

dtype = dict(zip(range(4000),['int8' for _ in range(3999)] + ['int32']))

Considering that this works:

import pandas as pd
import numpy as np
​
data = '''\
1,2,3
4,5,6'''
​
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, dtype={0:'int8',1:'int8',2:'int32'}, header=None)
​
print(df.dtypes)

Returns:

0     int8
1     int8
2    int32
dtype: object

From the docs:

dtype : Type name or dict of column -> type, default None

Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use str or object to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Sign up to request clarification or add additional context in comments.

Comments

4

Since you have no header, the column names are the integer order in which they occur, i.e. the first column is df[0]. To programmatically set the last column to be int32, you can read the first line of the file to get the width of the dataframe, then construct a dictionary of the integer types you want to use with the number of the columns as the keys.

import numpy as np
import pandas as pd

with open('file.dat') as fp:
    width = len(fp.readline().strip().split(','))
    dtypes = {i: np.int8 for i in range(width)}
    # update the last column's dtype
    dtypes[width-1] = np.int32

    # reset the read position of the file pointer
    fp.seek(0)
    df = pd.read_csv(fp, sep=',', engine='c', header=None, 
                     na_filter=False, dtype=dtypes, low_memory=False)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.