Scala/Spark array of dataframes

Question

[We are testing a driver that may provide excellent parallelism when optimized. Trick is, it does not parallelize (in accessing DB2) inside Spark partitions, so the requirement is that we tell it how many parallel threads we want, and we throw in a query for each thread. While I had hoped to do this in a loop with an array of DataFrame objects, I could not figure out how to write a scala with an array of DataFrame objects. For a brute force test I did:

    val DF1 = sqlContext.read.format("jdbc"). ...yada yada
    val DF2 = sqlContext.read.format("jdbc"). ...yada yada
    val DF3 = sqlContext.read.format("jdbc"). ...yada yada
    val DF4 = sqlContext.read.format("jdbc"). ...yada yada

    val unionDF=(((DF1.unionAll(DF2)).unionAll(DF3)).unionAll(DF4))

And this worked great for parallelizing into 4 partitions. I'd prefer to do it in a loop, but then it would appear I'd need something like:

var myDF = new Array[DataFrame](parallelBreakdown) ... and DataFrame is not a type. Any thoughts on doing this w/out the brute force method? Thanks,

it does not parallelize (in accessing DB2) inside Spark partition - why not simply increase number of partitions? What you want is a just a matter of standard operations on Scala collections but look like XY problem. — zero323
– zero323, Commented Mar 3, 2016 at 22:17
A loop...are you talking about something like Seq(DF1,DF2,DF3,DF4).reduce( _.unionAll(_)) ? — Roberto Congiu
– Roberto Congiu, Commented Mar 3, 2016 at 23:25
First of all, thanks for responding. The reason I need to do them separately is that I'm testing a new driver that works w/its own form of parallelizing. But, unlike Spark where I can specify partition, lower bound, yada yada, I need to identify # .. and submit a separate query for each partition. My thought was that it would be nice to define an array of DataFrame objects and then just drive a loop thru the range of 1 to NumPartitions. Then unionAll each element of the array. It sort of worked when I just wrote to the same DF object in the loop, but it did not parallelize. Thanks again — Michael Casile
– Michael Casile, Commented Mar 5, 2016 at 20:26

Konchshell · Accepted Answer · 2018-03-01 04:12:09Z

1

DataFrame is indeed a type

import org.apache.spark.sql.DataFrame

I was able to define a function

def querier(dim_vals: Array[String]): = {
    dim_vals.flatMap( dim_val => 
        sql(MY_QUERY))
    }

which returns Array[DataFrame] and I was able to use Robert Congiu's answer to create a single dataframe, and call .show() on it.

answered Mar 1, 2018 at 4:12

Konchshell

1123 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Scala/Spark array of dataframes

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related