As with pandas, when dask reads a text file, it can't rely on encoding metadata specifying column types like it can when reading parquet files or other binary types. But in dask's case, the entire file can't be parsed just to determine the type metadata - that would defeat the purpose. So dask is stuck with two options. Either you can tell it what the data types are, or it can take a peak at the data, and assume the rest of the data has the same structure.
From the dask.dataframe.read_csv docs:
Dask dataframe tries to infer the dtype of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if the dtype is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was a NaN, then this would error at compute time. To fix this, you have a few options:
Provide explicit dtypes for the offending columns using the dtype keyword. This is the recommended solution.
Use the assume_missing keyword to assume that all columns inferred as integers contain missing values, and convert them to floats.
Increase the size of the sample using the sample keyword.
It should also be noted that this function may fail if a CSV file includes quoted strings that contain the line terminator. To get around this you can specify blocksize=None to not split files into multiple partitions, at the cost of reduced parallelism.
Here's a simple illustration of the problem. Let's say we have the following (malformed) CSV file:
In [8]: myfile = '''
...: day,new_confirmed
...: 0,0
...: 1,4
...: 2,9
...: 3,2
...: 4,12
...: day,new_confirmed
...: 6,18
...: 7,19
...: 8,13
...: 9,20
...: '''.strip()
In [9]: with open('myfile.csv', 'w') as f:
...: f.write(myfile)
...:
If we tried to read this in with pandas, it would encounter the strings on row 6, cast the new_confirmed column as object type, and you'd know immediately that you have some data cleaning to do.
With dask, the situation is tougher. We'll read this file in, and only sample the first 5 rows to infer types. The dask default is higher, but this is a small example :)
In [10]: df = dask.dataframe.read_csv('myfile.csv', sample_rows=5)
In [11]: df
Out[11]:
Dask DataFrame Structure:
day new_confirmed
npartitions=1
int64 int64
... ...
Dask Name: read-csv, 1 tasks
In this case, you can see that dask thinks the column is integer type. This is even carried through into computations - when taking the mean, our buggy integer column was converted into a float:
In [12]: df.new_confirmed.mean()
Out[12]: dd.Scalar<series-..., dtype=float64>
We only find out we have a problem later, when computing the result:
In [13]: df.new_confirmed.mean().compute()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 df.new_confirmed.mean().compute()
...
File ~/opt/miniconda3/envs/myenv/lib/python3.10/site-packages/dask/dataframe/io/csv.py:284, in coerce_dtypes(df, dtypes)
281 msg = "Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.\n\n%s" % (
282 rule.join(filter(None, [dtype_msg, date_msg]))
283 )
--> 284 raise ValueError(msg)
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+---------------+--------+----------+
| Column | Found | Expected |
+---------------+--------+----------+
| new_confirmed | object | int64 |
+---------------+--------+----------+
The following columns also raised exceptions on conversion:
- new_confirmed
ValueError("invalid literal for int() with base 10: 'new_confirmed'")
Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'new_confirmed': 'object'}
to the call to `read_csv`/`read_table`.
The best dask can do is to offer a helpful error message, which it did. So we can use this to understand that we have cleaning work we need to do!