1

I have a dataframe which has one of the column called "Query" having the select statement present. Want to execute this query and create a new column having actual results from the TempView.

+--------------+-----------+-----+----------------------------------------+
|DIFFCOLUMNNAME|DATATYPE   |ISSUE|QUERY                                   |
+--------------+-----------+-----+----------------------------------------+
|Firstname     |StringType |YES  |Select Firstname from TempView  limit 1 |
|LastName      |StringType |NO   |Select LastName from TempView  limit 1  |
|Designation   |StringType |YES  |Select Designation from TempView limit 1|
|Salary        |IntegerType|YES  |Select Salary from TempView    limit 1  |
+--------------+-----------+-----+----------------------------------------+

Getting error as Type mismatch, Required String found column. Do I need to use UDF here. But not sure how to write and use. Please suggest

DF.withColumn("QueryResult", spark.sql(col("QUERY")))

TempView is Temporary View which I have created having all the required columns. Expected final Dataframe will be something like this with the new column added QUERYRESULT.

+--------------+-----------+-----+----------------------------------------+------------+
|DIFFCOLUMNNAME|DATATYPE   |ISSUE|QUERY                                   | QUERY RESULT
+--------------+-----------+-----+----------------------------------------+------------+
|Firstname     |StringType |YES  |Select Firstname from TempView  limit 1 | Bunny      |
|LastName      |StringType |NO   |Select LastName from TempView  limit 1  | Gummy      |
|Designation   |StringType |YES  |Select Designation from TempView limit 1| Developer  |
|Salary        |IntegerType|YES  |Select Salary from TempView    limit 1  | 100        |
+--------------+-----------+-----+----------------------------------------+------------+

3
  • show some code for others to look at. unusual construct Commented Sep 14, 2021 at 12:01
  • 1
    I added the code, The one line having withColumn clause. Getting error since expected is String and getting column Commented Sep 14, 2021 at 12:53
  • The short answer is "no, you cannot do that". What you can do is a work-around illustrated in pasha701s answer: collect the queries so that they are available in the driver and then execute the queries one by one. But why would you store the queries in a Spark dataframe at all when the data needs to be present in the driver process anyway? It would probably be easier to use a list of case classes instead of a Spark dataframe to hold the queries. Commented Sep 14, 2021 at 15:24

2 Answers 2

3

If number of queries is limited, you can collect them, execute each, and join with original queries dataframe (Kieran was faster with his answer, but mine answer has example):

val queriesDF = Seq(
  ("Firstname", "StringType", "YES", "Select Firstname from TempView  limit 1 "),
  ("LastName", "StringType", "NO", "Select LastName from TempView  limit 1 "),
  ("Designation", "StringType", "YES", "Select Designation from TempView limit 1"),
  ("Salary", "IntegerType", "YES", "Select Salary from TempView limit 1 ")
).toDF(
  "DIFFCOLUMNNAME", "DATATYPE", "ISSUE", "QUERY"
)
val data = Seq(
  ("Bunny", "Gummy", "Developer", 100)
)
  .toDF("Firstname", "LastName", "Designation", "Salary")

data.createOrReplaceTempView("TempView")

// get all queries and evaluate results
val queries = queriesDF.select("QUERY").distinct().as(Encoders.STRING).collect().toSeq
val queryResults = queries.map(q => (q, spark.sql(q).as(Encoders.STRING).first()))
val queryResultsDF = queryResults.toDF("QUERY", "QUERY RESULT")

// Join original queries and results
queriesDF.alias("queriesDF")
  .join(queryResultsDF, Seq("QUERY"))
  .select("queriesDF.*", "QUERY RESULT")

Output:

+----------------------------------------+--------------+-----------+-----+------------+
|QUERY                                   |DIFFCOLUMNNAME|DATATYPE   |ISSUE|QUERY RESULT|
+----------------------------------------+--------------+-----------+-----+------------+
|Select Firstname from TempView  limit 1 |Firstname     |StringType |YES  |Bunny       |
|Select LastName from TempView  limit 1  |LastName      |StringType |NO   |Gummy       |
|Select Designation from TempView limit 1|Designation   |StringType |YES  |Developer   |
|Select Salary from TempView limit 1     |Salary        |IntegerType|YES  |100         |
+----------------------------------------+--------------+-----------+-----+------------+
Sign up to request clarification or add additional context in comments.

Comments

2

Assuming you don't have that many 'query rows', just collect the results to driver using df.collect() and then map over queries using plain Scala.

6 Comments

That's not an answer to this actual question. It's an alternative - possibly.
Sure, its not the answer to the current wording, but it answers the intent of their question...
They tend to be finnicky here, so so am I for once.
Then edit the question and be more specific. Given the question they could need a little more info.
The first line of their question is "I have a dataframe which has one of the column called "Query" having the select statement present. Want to execute this query and create a new column having actual results from the TempView."
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.