How to create dask dataframe from CSV file stored in HDFS(many part files)

Question

I am trying to create dask dataframe from HDFS file(csv). The csv file stored in HDFS has many part files.

On read_csv api call:

dd.read_csv("hdfs:<some path>/data.csv")

Following error occurs:

OSError: Could not open file: <some path>/data.csv, mode: rb Path is not a file: <some path>/data.csv

In fact /data.csv is directory containing many part files. I'm not sure if there is some different API to read such hdfs csv.

Can you ensure that your path-string looks like "hdfs:/some/path/data.csv/*.csv" (note the '/' after the colon and the glob pattern)? — mdurant
– mdurant, Commented Sep 29, 2017 at 0:49
@mdurant: If I may ask in this thread itself, dask is not able to read parquet(on hdfs and does not have metadata) files saved by spark. Is there any fix for that. — Santosh Kumar
– Santosh Kumar, Commented Sep 29, 2017 at 1:49
Yes you can: explicitly pass the list of files, e.g., from running hdfs.glob('/path/parquet/*.parq'). Also, spark does have an option to write the metadata file. — mdurant
– mdurant, Commented Sep 29, 2017 at 2:42

MRocklin · Accepted Answer · 2017-09-29 12:30:17Z

2

Dask does not know which files you intend to read from when you pass only a directory name. You should pass a glob string uses to search for files or an explicit list of files, e.g.,

df = dd.read_csv("hdfs:///some/path/data.csv/*.csv")

Note the leading '/' after the colon: all hdfs paths begin this way.

edited Sep 29, 2017 at 12:30

MRocklin

57.5k29 gold badges176 silver badges245 bronze badges

answered Sep 29, 2017 at 2:44

mdurant

28.8k5 gold badges49 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to create dask dataframe from CSV file stored in HDFS(many part files)

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related