Apache Spark Throws java.lang.IllegalStateException: unread block data

Question

What we are doing is:

Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
Building a fat jar with a Spark app with sbt then trying to run it on the cluster

I've also included code snippets, and sbt deps at the bottom.

When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf

Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar).

But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a "mismatching version exception: you have user code using version X and node Y has version Z".

I would be very grateful for advice on this. I've submitted a bug report, because there has to be something wrong with the Spark documentation because I've seen two independent sysadms get the exact same problem with different versions of CDH on different clusters. https://issues.apache.org/jira/browse/SPARK-1867

The exception:

Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
    at akka.actor.ActorCell.invoke(ActorCell.scala:456)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
    at akka.dispatch.Mailbox.run(Mailbox.scala:219)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59]

My code snippet:

val conf = new SparkConf()
               .setMaster(clusterMaster)
               .setAppName(appName)
               .setSparkHome(sparkHome)
               .setJars(SparkContext.jarOfClass(this.getClass))

println("count = " + new SparkContext(conf).textFile(someHdfsPath).count())

My SBT dependencies:

// relevant
"org.apache.spark" % "spark-core_2.10" % "0.9.1",
"org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0",

// standard, probably unrelated
"com.github.seratch" %% "awscala" % "[0.2,)",
"org.scalacheck" %% "scalacheck" % "1.10.1" % "test",
"org.specs2" %% "specs2" % "1.14" % "test",
"org.scala-lang" % "scala-reflect" % "2.10.3",
"org.scalaz" %% "scalaz-core" % "7.0.5",
"net.minidev" % "json-smart" % "1.2"

Dici · Accepted Answer · 2016-06-27 00:32:08Z

3

Changing

"org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0",

to

"org.apache.hadoop" % "hadoop-common" % "2.3.0-cdh5.0.0"

In my application code seemed to fix this. Not entirely sure why. We have hadoop-yarn on the cluster, so maybe the "mr1" broke things.

edited Jun 27, 2016 at 0:32

Dici

26k7 gold badges45 silver badges86 bronze badges

answered May 23, 2014 at 15:55

samthebest

31.7k25 gold badges106 silver badges153 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

reducer · Accepted Answer · 2014-10-30 17:40:00Z

0

I recently ran into this issue with CDH 5.2 + Spark 1.1.0.

Turns out the problem was in my spark-submit command I was using

--master yarn

instead of the new

--master yarn-cluster

answered Oct 30, 2014 at 17:40

reducer

311 silver badge2 bronze badges

Collectives™ on Stack Overflow

Apache Spark Throws java.lang.IllegalStateException: unread block data

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related