I have a large CSV file with over 200+ columns. Some of the columns are string, some varchar, some integers and some floats.
When i just read my csv file into a pandas dataframe, it is able to detect which are the numerical columns. However, it will give me the specify dtype or low memory error warning.
df = pd.read_csv('myfile.csv')
df_not_num = df_raw.select_dtypes(exclude =[np.number,np.int16,np.bool,np.float32])
print len(df)
>>>200
print len(list(df_not_num))
>>> 10
Then i try to specify a dtype: dtype='unicode'
But this causes all my columns to be objects.
It is too much manual work to speicfy each dtype per column name when reading the CSV into a dataframe.
pd.read_csv('myfile.csv', dtype = 'unicode')
df_not_num = df_raw.select_dtypes(exclude =[np.number,np.int16,np.bool,np.float32])
print len(df)
>>>>200
print len(list(df_not_num))
>>> 200
So the only way to avoid the low memory warning is to specify a dtype. But how do i specify that i have mixed dtypes for different columns without having to manually specify the dtype of each of the 200 columns?
read_csv. You either have to specify particular types for some columns by passing a dict, e.g.:{‘a’: np.float64, ‘b’: np.int32}or specify one dtype, which will try to be applied to all columns, or none. Also, there is no "varchar" type in Python.