0
import scala.collection.mutable.ArrayBuffer

spark.sql("set db=test_script")
spark.sql("set table=member_test")

val colDF = sql("show columns from ${table} from ${db}")
var tempArray = new ArrayBuffer[String]()
var temp
colDF.foreach { row => row.toSeq.foreach { col => 
 temp = "count(case when "+ col+ " ='X' then 1 else NULL END) AS count"+ col
 tempArray += temp
}}

println(tempArray) // getting empty array
println(temp) // getting blank string

Hi, I am new to scala programming. I am trying to loop through a dataframe and append the formatted String data to my ArrayBuffer. When I put the print statement inside the for loop, everything, seems to be fine, whereas If i try to access the arrayBuffer outside the loop, its empty. Is it something related to the scope of the variable? I am using arrayBuffer, because I got to know that list is mutable in Scala. Please suggest any better way if you have. Thanks in advance

0

1 Answer 1

1

The issue you are having is that spark is a distributed system, which means copies of your buffer are sent to each executor (And not returned back to the driver), hence why it is empty.

Also note that colDF is a DataFrame. This means that when you do

row => row.toSeq

The result of this is an Array(Any) (this isn't good practice). A better way of doing this would be:

val dataFrame: DataFrame = spark.sql("select * from test_script.member_test")
val columns: Array[String] = dataFrame.columns
val sqlStatement = columns.map(c => s"count(case when $c = 'X' then 1 else NULL END) as count$c")

However, even better is not to use SQL at all and use Spark!

val dataFrame: DataFrame = spark.sql("select * from test_script.member_test")
val columns: Array[String] = dataFrame.columns
val selectStatement: List[Column] = columns.map{ c =>
    count(when(col(c) === "X", lit(1)).as(s"count$c")
}.toList
dataFrame.select(selectStatement :_*)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks so much, it works. Everything in this code is clear except the last line selectStatement :_* .
The method signature of select is: def select(cols: Column*) . It expects you to use it like this: dataFrame.select(col("a"),col("b"),col("c")...). ": _*" is special syntax to unpack a list so that it is interpreted as if you had just put them in comma separated into the method (without "List")).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.