Spark error "Output directory file already exists

Question

I executed simple sample (spark, Windows7) and get unexpected error message FileAlreadyExistsException. I cannot find the folder or file on my computer.

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/PluralsightData/ReadMeWordCountViaApp already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1191) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1168) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1168)

package main

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._

object WordCounter {
    def main(args: Array[String]) {
        val conf = new SparkConf().setAppName("Word Counter")
        val sc = new SparkContext(conf)
        //val textFile = sc.textFile("file:///Spark/README.md")
        val textFile = sc.textFile("file:///README.md")
        val tokenizedFileData = textFile.flatMap(line=>line.split(" "))
        val countPrep = tokenizedFileData.map(word=>(word, 1))
        val counts = countPrep.reduceByKey((accumValue, newValue)=>accumValue + newValue)
        val sortedCounts = counts.sortBy(kvPair=>kvPair._2, false)
        sortedCounts.saveAsTextFile("file:///PluralsightData/ReadMeWordCountViaApp")
    }
}

Sources of the sample can be found https://github.com/constructor-igor/TechSugar/tree/master/ScalaSamples/WordCounterSample

Well... it is as clear as it says that output directory already exists and thus your output saveAsTextFile will not work. Most big-data frameworks prefer to avoid the chances of over-writing any existing data. So... they do not allow output in existing directories. Just pick some other directory for your output. — sarveshseri
– sarveshseri, Commented Feb 6, 2017 at 13:50
How can I found directory where saveAsTextFile store result and open it? — constructor
– constructor, Commented Feb 6, 2017 at 16:13
What about using an absolute path like "file:///C:/temp/WordCount? Or look at stackoverflow.com/questions/38669206/… about some possible glitches across Spark versions. — Samson Scharfrichter
– Samson Scharfrichter, Commented Feb 6, 2017 at 22:28

constructor · Accepted Answer · 2017-02-07 11:06:20Z

1

According to comments:

Spark prefer to avoid over-writing any existing data.
Absolute path of target file allows to find result's data on local disk.

sortedCounts.saveAsTextFile("file:///C:/temp/ReadMeWordCountViaApp")

answered Feb 7, 2017 at 11:06

constructor

1,4201 gold badge18 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark error "Output directory file already exists

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related