28

To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select().

Any idea? :)

4 Answers 4

34

There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.

  • Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety.

  • Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).

Sign up to request clarification or add additional context in comments.

1 Comment

Do you answer the same if the question is about SQL order by vs Spark orderBy method? Thanks.
7

By using DataFrame, one can break the SQL into multiple statements/queries, which helps in debugging, easy enhancements and code maintenance.

Breaking complex SQL queries into simpler queries and assigning the result to a DF brings better understanding.

By splitting query into multiple DFs, developer gain the advantage of using cache, reparation (to distribute data evenly across the partitions using unique/close-to-unique key).

1 Comment

can we do caching of data at intermediate level when we have spark sql query?? because we can easily do it by splitting the query into many parts when using dataframe APIs.
6

Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style. In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.

Comments

1

The only thing that matters is what kind of underlying algorithm is used for grouping. HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n) HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map. Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)

1 Comment

can we do caching of data at intermediate leve when we have spark sql query?? because we can easily do it by splitting the query into many parts when using dataframe APIs.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.