Spark sql queries vs dataframe functions

Question

To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select().

Any idea? :)

3 revs, 2 users 75% · Accepted Answer · 2020-05-13 12:29:06Z

34

There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.

Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety.
Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).

edited May 13, 2020 at 12:29

community wiki

3 revs, 2 users 75%
zero323

Sign up to request clarification or add additional context in comments.

1 Comment

rulo4 Over a year ago

Do you answer the same if the question is about SQL order by vs Spark orderBy method? Thanks.

Ram · Accepted Answer · 2017-12-26 15:00:10Z

7

By using DataFrame, one can break the SQL into multiple statements/queries, which helps in debugging, easy enhancements and code maintenance.

Breaking complex SQL queries into simpler queries and assigning the result to a DF brings better understanding.

By splitting query into multiple DFs, developer gain the advantage of using cache, reparation (to distribute data evenly across the partitions using unique/close-to-unique key).

answered Dec 26, 2017 at 15:00

Ram

1111 silver badge4 bronze badges

1 Comment

akash patel Over a year ago

can we do caching of data at intermediate level when we have spark sql query?? because we can easily do it by splitting the query into many parts when using dataframe APIs.

Danylo Zherebetskyy · Accepted Answer · 2017-09-14 22:25:21Z

6

Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style. In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.

edited Sep 14, 2017 at 22:25

answered Jul 28, 2017 at 0:00

Danylo Zherebetskyy

1,52714 silver badges9 bronze badges

Comments

Sourab · Accepted Answer · 2020-06-24 02:21:20Z

1

The only thing that matters is what kind of underlying algorithm is used for grouping. HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n) HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map. Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)

answered Jun 24, 2020 at 2:21

Sourab

293 bronze badges

1 Comment

akash patel Over a year ago

can we do caching of data at intermediate leve when we have spark sql query?? because we can easily do it by splitting the query into many parts when using dataframe APIs.

Collectives™ on Stack Overflow

Spark sql queries vs dataframe functions

4 Answers 4

1 Comment

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related