1

Let us consider the PySpark SQL module. I'd like to know if there is a difference, in terms of performance or other potential indicators, between running a query through the SQL-like API and by embedding SQL clauses directly.

A typical example, with a SELECT query, would be

df.where(df.col == 'value')

as opposed to

sqlc.sql(SELECT * FROM df WHERE col='value')

where sqlc is the SQLContext in PySPark.

If there are no differences, where does the possibility to use both syntaxes come from?

3
  • With the first you don't have to write SQL (although of course you need to know it) Commented Mar 18, 2016 at 16:07
  • You could use timeit to profile the difference. Commented Mar 18, 2016 at 16:08
  • 1
    .explain() will show you that it is exactly the same. Commented Mar 18, 2016 at 16:25

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.