Let us consider the PySpark SQL module. I'd like to know if there is a difference, in terms of performance or other potential indicators, between running a query through the SQL-like API and by embedding SQL clauses directly.
A typical example, with a SELECT query, would be
df.where(df.col == 'value')
as opposed to
sqlc.sql(SELECT * FROM df WHERE col='value')
where sqlc is the SQLContext in PySPark.
If there are no differences, where does the possibility to use both syntaxes come from?
timeitto profile the difference..explain()will show you that it is exactly the same.