0

I'd like to perform SQL like syntax on Spark data frame df. Let's say I need a calculation

cal_col = 113.4*col1 +41.4*col2....

What I do at the moment is either :

1/ Broadcasting as temp view:

df.createOrReplaceTempView("df_view")
df = spark.sql("select *, 113.4*col1 +41.4*col2... AS cal_col from df_view")

Question : Is there a lot of overhead by broadcasting a big df as view ? If yes, at which point it no longer makes sense ? Let's say df has 250 columns, 15Million records.

2/ Pyspark dataframe syntax, which is a bit more difficult to read and need modification from the formula :

df = df.withColumn("cal_col", 113.4*F.col("col1") + 41.4*F.col("col2")+...)

The formula may be lengthy and become difficult to read.

Question: Is there a way to write as SQL-like syntax without F.col ?

Something along the line

 df = df.select("*, (113.4*col1 +41.4*col2...) as cal_col")

1 Answer 1

1

You can use df.selectExpr("") to write spark in SQL like syntax on your dataframe.

df.selectExpr("*, (113.4*col1 +41.4*col2...) as cal_col")

Also, a better way to do want you want instead of creating a view, is to df.persist() before your logic to send the dataframe to memory(and spill to disk- by default) and then run your selectExpr on it.

Link: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.selectExpr

Sign up to request clarification or add additional context in comments.

5 Comments

How is the cost of createOrReplaceTempView ? I thought it just create a logical variable to be used as table name, not actually create anything physical. persist, on the other hand, create the whole physical table inside spark memory.
You are correct, TempView is just a lazily evaluated view, and it is not in memory. If you want the temp view to go in memory, the df you are using to create that view needs to be cached for the view to be in memory. I thought u wanted to go in memory with selectExpr thats why i gave the persist suggestion.
No, I am not looking to cache into memory. Right now the only reason I go for tempView is to be able to write SQL-like query, not to have something in memory. Hence the question of avoiding that, using native pySpark syntax without the need to create that tempView (if it costs). Turns out it does not cost much as you said and as I thought. So tempView is still a cheap option, just that it requires another line of code.
Another related question is : how rich is selectExpr. Does it support more complicated query .selectExpr("*, (113.4*col1 +41.4*col2...) as cal_col, case when col3 <0 then 0 else 1 end as col3_indic where col_filter in (1,10) group by... having ... order by ...")
I only using expressions in pyspark for dealing with complex data types, higher order types or nested data, but im almost positive that you can run that query using .selectExpr as long as it abides by sql syntax. would still recommend using pyspark api/dataframe api as its more readable and understandable for non database/sql people.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.