1

We are using Spark CSV reader to read the csv file to convert as DataFrame and we are running the job on yarn-client, its working fine in local mode.

We are submitting the spark job in edge node.

But when we place the file in local file path instead of HDFS, we are getting file not found exception.

Code:

sqlContext.read.format("com.databricks.spark.csv")
      .option("header", "true").option("inferSchema", "true")
      .load("file:/filepath/file.csv")

We also tried file:///, but still we are getting the same error.

Error log:

2016-12-24 16:05:40,044 WARN  [task-result-getter-0] scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hklvadcnc06.hk.standardchartered.com): java.io.FileNotFoundException: File file:/shared/sample1.csv does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
        at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
        at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
        at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
        at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:241)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
6
  • does that file exists at that location ? Commented Dec 24, 2016 at 12:38
  • @mrsrinivas: yes its available, thats why when i run the job in yarn cluster in local mode, its working fine, only it not working in yarn-client mode. Commented Dec 24, 2016 at 12:49
  • 1
    In normal case it has to work as you have tried. However , if the intention is to make it work then try SparkFiles your case something like this import org.apache.spark.SparkFiles SparkContext.addFile("file:/filepath/file.csv") println(SparkFiles.getRootDirectory()) println(SparkFiles.get("file.csv")) sqlContext.read.format("com.databricks.spark.csv") .option("header", "true").option("inferSchema", "true") .load(SparkFiles.get("file.csv")) Commented Dec 24, 2016 at 19:56
  • Also please post all the versions & spark-submit command along/as part of your question. Commented Dec 24, 2016 at 20:02
  • @Ram Ghadiyaram: thanks, I will try the Sparkfiles tomorrow and let you know.... Commented Dec 25, 2016 at 13:50

2 Answers 2

2

yes this will work fine in local mode but on edge node it wont work. Because from edge node the local file is not accessible. HDFS makes file accessible by specifying the URL of file.

Sign up to request clarification or add additional context in comments.

2 Comments

Does it mean we cannot read any files from linux file path, only hdfs location should be used to read files?
TBO I never tried this. What actually happen is that the path you provide for file must be accessible to master and worker node if nodes are unable to access the file then you face such issues. Now this question is based on networking. If you can make your local file accessible to master and worker node then you don't face such issue.
0

Seems like a bug. in spark-shell command when reading a local file, But there is a workaround while running spark-submit command just specify in command.

--conf "spark.authenticate=false"

SPARK-23476 for reference.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.