1

I am trying to create dask dataframe from HDFS file(csv). The csv file stored in HDFS has many part files.

On read_csv api call:

dd.read_csv("hdfs:<some path>/data.csv")

Following error occurs:

OSError: Could not open file: <some path>/data.csv, mode: rb Path is not a file: <some path>/data.csv

In fact /data.csv is directory containing many part files. I'm not sure if there is some different API to read such hdfs csv.

4
  • Can you ensure that your path-string looks like "hdfs:/some/path/data.csv/*.csv" (note the '/' after the colon and the glob pattern)? Commented Sep 29, 2017 at 0:49
  • Thanks mudrant, glob pattern worked :) Commented Sep 29, 2017 at 1:47
  • @mdurant: If I may ask in this thread itself, dask is not able to read parquet(on hdfs and does not have metadata) files saved by spark. Is there any fix for that. Commented Sep 29, 2017 at 1:49
  • Yes you can: explicitly pass the list of files, e.g., from running hdfs.glob('/path/parquet/*.parq'). Also, spark does have an option to write the metadata file. Commented Sep 29, 2017 at 2:42

1 Answer 1

2

Dask does not know which files you intend to read from when you pass only a directory name. You should pass a glob string uses to search for files or an explicit list of files, e.g.,

df = dd.read_csv("hdfs:///some/path/data.csv/*.csv")

Note the leading '/' after the colon: all hdfs paths begin this way.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.