Spark dataframe select using SQL without createOrReplaceTempView

Question

I'd like to perform SQL like syntax on Spark data frame df. Let's say I need a calculation

cal_col = 113.4*col1 +41.4*col2....

What I do at the moment is either :

1/ Broadcasting as temp view:

df.createOrReplaceTempView("df_view")
df = spark.sql("select *, 113.4*col1 +41.4*col2... AS cal_col from df_view")

Question : Is there a lot of overhead by broadcasting a big df as view ? If yes, at which point it no longer makes sense ? Let's say df has 250 columns, 15Million records.

2/ Pyspark dataframe syntax, which is a bit more difficult to read and need modification from the formula :

df = df.withColumn("cal_col", 113.4*F.col("col1") + 41.4*F.col("col2")+...)

The formula may be lengthy and become difficult to read.

Question: Is there a way to write as SQL-like syntax without F.col ?

Something along the line

 df = df.select("*, (113.4*col1 +41.4*col2...) as cal_col")

murtihash · Accepted Answer · 2020-03-31 17:07:50Z

1

You can use df.selectExpr("") to write spark in SQL like syntax on your dataframe.

df.selectExpr("*, (113.4*col1 +41.4*col2...) as cal_col")

Also, a better way to do want you want instead of creating a view, is to df.persist() before your logic to send the dataframe to memory(and spill to disk- by default) and then run your selectExpr on it.

Link: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.selectExpr

answered Mar 31, 2020 at 17:07

murtihash

8,4401 gold badge16 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Kenny Over a year ago

How is the cost of createOrReplaceTempView ? I thought it just create a logical variable to be used as table name, not actually create anything physical. persist, on the other hand, create the whole physical table inside spark memory.

murtihash Over a year ago

You are correct, TempView is just a lazily evaluated view, and it is not in memory. If you want the temp view to go in memory, the df you are using to create that view needs to be cached for the view to be in memory. I thought u wanted to go in memory with selectExpr thats why i gave the persist suggestion.

Kenny Over a year ago

No, I am not looking to cache into memory. Right now the only reason I go for tempView is to be able to write SQL-like query, not to have something in memory. Hence the question of avoiding that, using native pySpark syntax without the need to create that tempView (if it costs). Turns out it does not cost much as you said and as I thought. So tempView is still a cheap option, just that it requires another line of code.

Kenny Over a year ago

Another related question is : how rich is selectExpr. Does it support more complicated query .selectExpr("*, (113.4*col1 +41.4*col2...) as cal_col, case when col3 <0 then 0 else 1 end as col3_indic where col_filter in (1,10) group by... having ... order by ...")

murtihash Over a year ago

I only using expressions in pyspark for dealing with complex data types, higher order types or nested data, but im almost positive that you can run that query using .selectExpr as long as it abides by sql syntax. would still recommend using pyspark api/dataframe api as its more readable and understandable for non database/sql people.

Collectives™ on Stack Overflow

Spark dataframe select using SQL without createOrReplaceTempView

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related