1

[We are testing a driver that may provide excellent parallelism when optimized. Trick is, it does not parallelize (in accessing DB2) inside Spark partitions, so the requirement is that we tell it how many parallel threads we want, and we throw in a query for each thread. While I had hoped to do this in a loop with an array of DataFrame objects, I could not figure out how to write a scala with an array of DataFrame objects. For a brute force test I did:

    val DF1 = sqlContext.read.format("jdbc"). ...yada yada
    val DF2 = sqlContext.read.format("jdbc"). ...yada yada
    val DF3 = sqlContext.read.format("jdbc"). ...yada yada
    val DF4 = sqlContext.read.format("jdbc"). ...yada yada

    val unionDF=(((DF1.unionAll(DF2)).unionAll(DF3)).unionAll(DF4))

And this worked great for parallelizing into 4 partitions. I'd prefer to do it in a loop, but then it would appear I'd need something like:

var myDF = new Array[DataFrame](parallelBreakdown) ... and DataFrame is not a type. Any thoughts on doing this w/out the brute force method? Thanks,

3
  • 2
    it does not parallelize (in accessing DB2) inside Spark partition - why not simply increase number of partitions? What you want is a just a matter of standard operations on Scala collections but look like XY problem. Commented Mar 3, 2016 at 22:17
  • A loop...are you talking about something like Seq(DF1,DF2,DF3,DF4).reduce( _.unionAll(_)) ? Commented Mar 3, 2016 at 23:25
  • First of all, thanks for responding. The reason I need to do them separately is that I'm testing a new driver that works w/its own form of parallelizing. But, unlike Spark where I can specify partition, lower bound, yada yada, I need to identify # .. and submit a separate query for each partition. My thought was that it would be nice to define an array of DataFrame objects and then just drive a loop thru the range of 1 to NumPartitions. Then unionAll each element of the array. It sort of worked when I just wrote to the same DF object in the loop, but it did not parallelize. Thanks again Commented Mar 5, 2016 at 20:26

1 Answer 1

1

DataFrame is indeed a type

import org.apache.spark.sql.DataFrame

I was able to define a function

def querier(dim_vals: Array[String]): = {
    dim_vals.flatMap( dim_val => 
        sql(MY_QUERY))
    }

which returns Array[DataFrame] and I was able to use Robert Congiu's answer to create a single dataframe, and call .show() on it.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.