5

I have a CSV file with several columns that include integers and a string. Naturally, I get a dtype warning because of the mixed dtypes. I read the file with this general command.

df = pd.read_csv(path, sep=";", na_values=missing)

I could use low_memory=False or dtype=object to silence the warning but as far as I know this makes reading my file not more memory efficient.

I could also use na_values="my_string" but I have other missing values (which are supposed to be real missing values) and do not want to mix them.

I do not need the value of the string but only its value count so I thought of replacing it with an integer. Something like this.

df.replace(to_replace="my_string", value=999)

However, is it also possible to replace a value while reading a CSV file? Or does another solution exist? I do not want to simply silence the warning but find a solution which is more memory efficient.

(I know about this answer but it does not really help me with my problem.)

2
  • do you know names of "problematic" columns upfront? Commented Feb 16, 2018 at 12:59
  • @MaxU Yes, I know this. Commented Feb 16, 2018 at 13:04

2 Answers 2

6

You can use converters:

In [156]: def conv(val, default_val=999):
     ...:     try:
     ...:         return int(val)
     ...:     except ValueError:
     ...:         return default_val
     ...:

In [157]: conv('a')
Out[157]: 999

In [158]: pd.read_csv(r'C:\Temp\test.csv', converters={'a':conv})
Out[158]:
     a   b           c
0    1  11  2000-01-01
1  999  12  2000-01-02
2    3  13  2000-01-02

Another approach, would be to convert columns in a vectorized way after parsing a CSV file:

In [166]: df = pd.read_csv(r'C:\Temp\test.csv', parse_dates=['c'])

In [167]: df
Out[167]:
     a    b          c
0    1  AAA 2000-01-01
1  XXX   12 2000-01-02
2    3   13 2000-01-02

In [168]: df.dtypes
Out[168]:
a            object
b            object
c    datetime64[ns]
dtype: object

In [169]: int_cols = ['a','b']

In [170]: df[int_cols] = df[int_cols].apply(pd.to_numeric, errors='coerce').fillna(999).astype(int)

In [171]: df
Out[171]:
     a    b          c
0    1  999 2000-01-01
1  999   12 2000-01-02
2    3   13 2000-01-02

In [172]: df.dtypes
Out[172]:
a             int32
b             int32
c    datetime64[ns]
dtype: object

Speed comparison for 300.000 rows DF:

In [175]: df = pd.concat([df] * 10**5, ignore_index=True)

In [176]: df.shape
Out[176]: (300000, 3)

In [177]: filename = r'C:\Temp\test.csv'

In [184]: df.to_csv(filename, index=False)

In [185]: %%timeit
     ...: df = pd.read_csv(filename, parse_dates=['c'], converters={'a':conv, 'b':conv})
     ...:
632 ms ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [186]: %%timeit
     ...: df = pd.read_csv(filename, parse_dates=['c'])
     ...: df[int_cols] = df[int_cols].apply(pd.to_numeric, errors='coerce').fillna(999).astype(int)
     ...:
706 ms ± 60.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

1 Comment

@NK_, glad i could help :)
1

It's no possible to replace de values while you are reading a CSV file. You have to replace once you load the data and save it. Then you don't get warning anymore.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.