[We are testing a driver that may provide excellent parallelism when optimized. Trick is, it does not parallelize (in accessing DB2) inside Spark partitions, so the requirement is that we tell it how many parallel threads we want, and we throw in a query for each thread. While I had hoped to do this in a loop with an array of DataFrame objects, I could not figure out how to write a scala with an array of DataFrame objects. For a brute force test I did:
val DF1 = sqlContext.read.format("jdbc"). ...yada yada
val DF2 = sqlContext.read.format("jdbc"). ...yada yada
val DF3 = sqlContext.read.format("jdbc"). ...yada yada
val DF4 = sqlContext.read.format("jdbc"). ...yada yada
val unionDF=(((DF1.unionAll(DF2)).unionAll(DF3)).unionAll(DF4))
And this worked great for parallelizing into 4 partitions. I'd prefer to do it in a loop, but then it would appear I'd need something like:
var myDF = new Array[DataFrame](parallelBreakdown) ... and DataFrame is not a type. Any thoughts on doing this w/out the brute force method? Thanks,
Seq(DF1,DF2,DF3,DF4).reduce( _.unionAll(_))?