Reading data from SQL Server using Spark SQL

Question

Is it possible to read data from Microsoft Sql Server (and oracle, mysql, etc.) into an rdd in a Spark application? Or do we need to create an in memory set and parallize that into an RDD?

seems so...any reason why? if it can munge data from everything, why not the most common stores? — ashic
– ashic, Commented Oct 8, 2014 at 12:25
you will have to wait for a few days to get this answered as the tag apache-spark is very solemnly used. Wait for a couple of days for the apache guys to answer your question. — Deval Khandelwal
– Deval Khandelwal, Commented Oct 8, 2014 at 12:36
You can certainly read the data into the driver and then parallelize that into an RDD. If you're looking for a more scalable solution, you probably want to look into using DBInputFormat with Spark's "Hadoop API" methods. I haven't done this before, but it seems like something good to look into. — Nick Chammas
– Nick Chammas, Commented Oct 8, 2014 at 18:22

kanielc · Accepted Answer · 2015-09-16 12:52:34Z

6

In Spark 1.4.0+ you can now use sqlContext.read.jdbc

That will give you a DataFrame instead of an RDD of Row objects.

The equivalent to the solution you posted above would be

sqlContext.read.jdbc("jdbc:sqlserver://omnimirror;databaseName=moneycorp;integratedSecurity=true;", "TABLE_NAME", "id", 1, 100000, 1000, new java.util.Properties)

It should pick up the schema of the table, but if you'd like to force it, you can use the schema method after read sqlContext.read.schema(...insert schema here...).jdbc(...rest of the things...)

Note that you won't get an RDD of SomeClass here (which is nicer in my view). Instead you'll get a DataFrame of the relevant fields.

More information can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases

edited Sep 16, 2015 at 12:52

answered Aug 28, 2015 at 13:23

kanielc

1,3221 gold badge12 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ashic · Accepted Answer · 2014-10-23 11:30:59Z

Found a solution to this from the mailing list. JdbcRDD can be used to accomplish this. I needed to get the MS Sql Server JDBC driver jar and add it to the lib for my project. I wanted to use integrated security, and so needed to put sqljdbc_auth.dll (available in the same download) in a location that java.library.path can see. Then, the code looks like this:

     val rdd = new JdbcRDD[Email](sc,
          () => {DriverManager.getConnection(
 "jdbc:sqlserver://omnimirror;databaseName=moneycorp;integratedSecurity=true;")},
          "SELECT * FROM TABLE_NAME Where ? < X and X < ?",
            1, 100000, 1000,
          (r:ResultSet) => { SomeClass(r.getString("Col1"), 
            r.getString("Col2"), r.getString("Col3")) } )

This gives an Rdd of SomeClass.The second, third and fourth parameters are required and are for lower and upper bounds, and number of partitions. In other words, that source data needs to be partitionable by longs for this to work.

Collectives™ on Stack Overflow

Reading data from SQL Server using Spark SQL

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related