I'd like to perform SQL like syntax on Spark data frame df.
Let's say I need a calculation
cal_col = 113.4*col1 +41.4*col2....
What I do at the moment is either :
1/ Broadcasting as temp view:
df.createOrReplaceTempView("df_view")
df = spark.sql("select *, 113.4*col1 +41.4*col2... AS cal_col from df_view")
Question : Is there a lot of overhead by broadcasting a big df as view ? If yes, at which point it no longer makes sense ? Let's say df has 250 columns, 15Million records.
2/ Pyspark dataframe syntax, which is a bit more difficult to read and need modification from the formula :
df = df.withColumn("cal_col", 113.4*F.col("col1") + 41.4*F.col("col2")+...)
The formula may be lengthy and become difficult to read.
Question: Is there a way to write as SQL-like syntax without F.col ?
Something along the line
df = df.select("*, (113.4*col1 +41.4*col2...) as cal_col")