Not able to read text file from local file path - Spark CSV reader

Question

We are using Spark CSV reader to read the csv file to convert as DataFrame and we are running the job on yarn-client, its working fine in local mode.

We are submitting the spark job in edge node.

But when we place the file in local file path instead of HDFS, we are getting file not found exception.

Code:

sqlContext.read.format("com.databricks.spark.csv")
      .option("header", "true").option("inferSchema", "true")
      .load("file:/filepath/file.csv")

We also tried file:///, but still we are getting the same error.

Error log:

2016-12-24 16:05:40,044 WARN  [task-result-getter-0] scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hklvadcnc06.hk.standardchartered.com): java.io.FileNotFoundException: File file:/shared/sample1.csv does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
        at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
        at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
        at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
        at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:241)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

@mrsrinivas: yes its available, thats why when i run the job in yarn cluster in local mode, its working fine, only it not working in yarn-client mode. — Shankar
– Shankar, Commented Dec 24, 2016 at 12:49
In normal case it has to work as you have tried. However , if the intention is to make it work then try SparkFiles your case something like this import org.apache.spark.SparkFiles SparkContext.addFile("file:/filepath/file.csv") println(SparkFiles.getRootDirectory()) println(SparkFiles.get("file.csv")) sqlContext.read.format("com.databricks.spark.csv") .option("header", "true").option("inferSchema", "true") .load(SparkFiles.get("file.csv")) — Ram Ghadiyaram
– Ram Ghadiyaram, Commented Dec 24, 2016 at 19:56
Also please post all the versions & spark-submit command along/as part of your question. — Ram Ghadiyaram
– Ram Ghadiyaram, Commented Dec 24, 2016 at 20:02
@Ram Ghadiyaram: thanks, I will try the Sparkfiles tomorrow and let you know.... — Shankar
– Shankar, Commented Dec 25, 2016 at 13:50

Akash Sethi · Accepted Answer · 2016-12-24 09:25:49Z

2

yes this will work fine in local mode but on edge node it wont work. Because from edge node the local file is not accessible. HDFS makes file accessible by specifying the URL of file.

answered Dec 24, 2016 at 9:25

Akash Sethi

2,2941 gold badge23 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Shankar Over a year ago

Does it mean we cannot read any files from linux file path, only hdfs location should be used to read files?

Akash Sethi Over a year ago

TBO I never tried this. What actually happen is that the path you provide for file must be accessible to master and worker node if nodes are unable to access the file then you face such issues. Now this question is based on networking. If you can make your local file accessible to master and worker node then you don't face such issue.

sumitya · Accepted Answer · 2018-05-16 07:54:04Z

0

Seems like a bug. in spark-shell command when reading a local file, But there is a workaround while running spark-submit command just specify in command.

--conf "spark.authenticate=false"

SPARK-23476 for reference.

answered May 16, 2018 at 7:54

sumitya

2,7111 gold badge21 silver badges32 bronze badges

Collectives™ on Stack Overflow

Not able to read text file from local file path - Spark CSV reader

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related