12

I'm hitting very strange problem when trying to load JDBC DataFrame into Spark SQL.

I've tried several Spark clusters - YARN, standalone cluster and pseudo distributed mode on my laptop. It's reproducible on both Spark 1.3.0 and 1.3.1. The problem occurs in both spark-shell and when executing the code with spark-submit. I've tried MySQL & MS SQL JDBC drivers without success.

Consider following sample:

val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/test"

val t1 = {
  sqlContext.load("jdbc", Map(
    "url" -> url,
    "driver" -> driver,
    "dbtable" -> "t1",
    "partitionColumn" -> "id",
    "lowerBound" -> "0",
    "upperBound" -> "100",
    "numPartitions" -> "50"
  ))
}

So far so good, the schema gets resolved properly:

t1: org.apache.spark.sql.DataFrame = [id: int, name: string]

But when I evaluate DataFrame:

t1.take(1)

Following exception occurs:

15/04/29 01:56:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.1.42): java.sql.SQLException: No suitable driver found for jdbc:mysql://<hostname>:3306/test
    at java.sql.DriverManager.getConnection(DriverManager.java:689)
    at java.sql.DriverManager.getConnection(DriverManager.java:270)
    at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:158)
    at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:150)
    at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:317)
    at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

When I try to open JDBC connection on executor:

import java.sql.DriverManager

sc.parallelize(0 until 2, 2).map { i =>
  Class.forName(driver)
  val conn = DriverManager.getConnection(url)
  conn.close()
  i
}.collect()

it works perfectly:

res1: Array[Int] = Array(0, 1)

When I run the same code on local Spark, it works perfectly too:

scala> t1.take(1)
...
res0: Array[org.apache.spark.sql.Row] = Array([1,one])

I'm using Spark pre-built with Hadoop 2.4 support.

The easiest way to reproduce the problem is to start Spark in pseudo distributed mode with start-all.sh script and run following command:

/path/to/spark-shell --master spark://<hostname>:7077 --jars /path/to/mysql-connector-java-5.1.35.jar --driver-class-path /path/to/mysql-connector-java-5.1.35.jar

Is there a way to work this around? It looks like a severe problem, so it's strange that googling doesn't help here.

4 Answers 4

4

Apparently this issue has been recently reported:

https://issues.apache.org/jira/browse/SPARK-6913

The problem is in java.sql.DriverManager that doesn't see the drivers loaded by ClassLoaders other than bootstrap ClassLoader.

As a temporary workaround it's possible to add required drivers to boot classpath of executors.

UPDATE: This pull request fixes the problem: https://github.com/apache/spark/pull/5782

UPDATE 2: The fix merged to Spark 1.4

Sign up to request clarification or add additional context in comments.

2 Comments

Hey @Wildfire, how did you go about editing the boot classpath of the executors? I tried modifying the YARN container classpath to add the mysql-connector-java jar, but it did not fix the error. Thanks!
@JKnight I couldn't figure out how to add JDBC driver to boot classpath on YARN, so I'm just using patched Spark build.
3

For writing data to MySQL

In spark 1.4.0, you have to load MySQL before writing into it because it loads drivers on load function but not on write function. We have to put jar on every worker node and set the path in spark-defaults.conf file on each node. This issue has been fixed in spark 1.5.0

https://issues.apache.org/jira/browse/SPARK-10036

Comments

1

We are stuck on Spark 1.3 (Cloudera 5.4) and so I found this question and Wildfire's answer helpful since it allowed me to stop banging my head against the wall.

Thought I would share how we got the driver into the boot classpath: we simply copied it into /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/hive/lib on all the nodes.

2 Comments

Hi, did you restart Cloudera after adding the the driver to that folder ? thanks
@SantiArias yes I believe we did restart the cluster.
1

I am using spark-1.6.1 with SQL server, still faced the same issue. I had to add the library(sqljdbc-4.0.jar) to the lib in the instance and below line in conf/spark-dfault.conf file.

spark.driver.extraClassPath lib/sqljdbc-4.0.jar

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.