This issue happen when I am trying to do some test on Spark-SQL external datasource.
I build the dataframe in two ways, and compare the speed of collect action. And I find that if the column number is too large, the dataframe built from external datasource will lag behind. I want to know if this is the limitation of Spark-SQL's external datasource. :-)
To present the question more clear, I write a piece of code:
https://github.com/sunheehnus/spark-sql-test/
In my benchmark code for External Datasource API, which implement a fake external datasource ( actually an RDD[String, Array[Int]] ), and get the dataframe by
val cmpdf = sqlContext.load("com.redislabs.test.dataframeRP", Map[String, String]())
Then I build the same RDD and get dataframe by
val rdd = sqlContext.sparkContext.parallelize(1 to 2048, 3)
val mappedrdd = rdd.map(x =>(x.toString, (x to x + colnum).toSeq.toArray))
val df = mappedrdd.toDF()
val dataColExpr = (0 to colnum).map(_.toString).zipWithIndex.map { case (key, i) => s"_2[$i] AS `$key`" }
val allColsExpr = "_1 AS instant" +: dataColExpr
val df1 = df.selectExpr(allColsExpr: _*)
When I run the test code, I can see the result(on my laptop):
9905
21427
But when I make the column less(512), I can see the result:
4323
2221
Looks like the question is that if the column count is small in Schema, External Datasource API will benefits, but with the growing of column count in Schema, External Datasource API will finally lag behind...... I am wondering if this is the Spark-SQL's limitaion for External Datasource API, or am I using the API in a wrong way? Thanks very much. :-)