Writing SQL vs using Dataframe APIs in Spark SQL

Question

I am a newbie in Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage,Raw and Application layer in HDFS and doing CDC(change data capture), this is currently written in Hive queries and is executed via Oozie. This needs to migrate into a Spark application(current version 1.6). The other section of code will migrate later on.

In spark-SQL, I can create dataframes directly from tables in Hive and simply execute queries as it is (like sqlContext.sql("my hive hql") ). The other way would be to use dataframe APIs and rewrite the hql in that way.

What is the difference in these two approaches?

Is there any performance gain with using Dataframe APIs?

Some people suggested, there is an extra layer of SQL that spark core engine has to go through when using "SQL" queries directly which may impact performance to some extent but I didn't find any material substantiating that statement. I know the code would be much more compact with Datafrmae APIs but when I have my hql queries all handy would it really worth to write complete code into Dataframe API?

Thank You.

pls check my answer!. moreover dataframe uses tungsten memory representation , catalyst optimizer used by sql as well as dataframe — Ram Ghadiyaram
– Ram Ghadiyaram, Commented Aug 1, 2017 at 15:34
One more thing to note. With Dataset API, you have more control on the actual execution plan than with SparkSQL. — Ross Brigoli
– Ross Brigoli, Commented Nov 21, 2019 at 3:53

Leighton Ritchie · Accepted Answer · 2020-01-09 16:56:49Z

33

Question : What is the difference in these two approaches? Is there any performance gain with using Dataframe APIs?

Answer :

There is comparative study done by horton works. source...

Gist is based on situation/scenario each one is right. there is no hard and fast rule to decide this. pls go through below..

RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):

At its core, Spark operates on the concept of Resilient Distributed Datasets, or RDD’s:

Resilient - if data in memory is lost, it can be recreated
Distributed - immutable distributed collection of objects in memory partitioned across many data nodes in a cluster
Dataset - initial data can from from files, be created programmatically, from data in memory, or from another RDD

DataFrames API is a data abstraction framework that organizes your data into named columns:

Create a schema for the data
Conceptually equivalent to a table in a relational database
Can be constructed from many sources including structured data files, tables in Hive, external databases, or existing RDDs
Provides a relational view of the data for easy SQL like data manipulations and aggregations
Under the hood, it is an RDD of Row’s

SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL through:

SQL
DataFrames API
Datasets API

Test results:

RDD’s outperformed DataFrames and SparkSQL for certain types of data processing
DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage
Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD’s
Took the best out of 3 for each test
Times were consistent and not much variation between tests
Jobs were run individually with no other jobs running

Random lookup against 1 order ID from 9 Million unique order ID's GROUP all the different products with their total COUNTS and SORT DESCENDING by product name

edited Jan 9, 2020 at 16:56

Leighton Ritchie

5015 silver badges15 bronze badges

answered Aug 1, 2017 at 13:12

Ram Ghadiyaram

29.4k16 gold badges101 silver badges133 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Den R Over a year ago

This study is relevant for spark 1.6. Spark 2.3 has Tungsten and does lots of optimization via codegen and column based internal storage so results could be many times faster than ones of Spark 1.6

akash patel Over a year ago

can we do caching of data at intermediate level when we have spark sql query?? because we can easily do it by splitting the query into many parts when using dataframe APIs.

Hanan Shteingart Over a year ago

sad results above. 3 times slower for DF interface????!!!

Arun Sharma · Accepted Answer · 2018-05-28 21:07:50Z

27

In your Spark SQL string queries, you won't know a syntax error until runtime (which could be costly), whereas in DataFrames syntax errors can be caught at compile time.

answered May 28, 2018 at 21:07

Arun Sharma

4714 silver badges6 bronze badges

2 Comments

MAC Over a year ago

You can use printSchema() to catch syntax error during lazy evaluation in spark SQL. If the schema prints, that means there are no syntax errors.

jonathanChap Over a year ago

Not true, for example you may mistype a column name/other literal or code may pass the scala type system (ie. will compile) but will fail at runtime

Blue Clouds · Accepted Answer · 2021-01-16 11:55:59Z

3

Couple more additions. Dataframe uses tungsten memory representation , catalyst optimizer used by sql as well as dataframe. With Dataset API, you have more control on the actual execution plan than with SparkSQL

answered Jan 16, 2021 at 11:55

Blue Clouds

8,3729 gold badges84 silver badges132 bronze badges

1 Comment

PHPirate Over a year ago

This answer just copied the comments on the question

G.S.Tomar · Accepted Answer · 2019-09-09 12:43:59Z

2

If query is lengthy, then efficient writing & running query, shall not be possible. On the other hand, DataFrame, along with Column API helps developer to write compact code, which is ideal for ETL applications.

Also, all operations (e.g. greater than, less than, select, where etc.).... ran using "DataFrame" builds an "Abstract Syntax Tree(AST)", which is then passed to "Catalyst" for further optimizations. (Source: Spark SQL Whitepaper, Section#3.3)

edited Sep 9, 2019 at 12:43

answered Sep 6, 2019 at 7:42

G.S.Tomar

3082 silver badges15 bronze badges

3 Comments

Vikrant Singh Rana Over a year ago

what you mean by "efficient writing & running query, shall not be possible".

G.S.Tomar Over a year ago

comparatively less chances of syntax/semantics errors, while authoring queries. If you have authored queries in JDBC Vs Hibernate Criteria API , then you can understand the intent very well

akash patel Over a year ago

@G.S.Tomar can we do caching of data at intermediate leveL when we have spark sql query?? because we can easily do it by splitting the query into many parts when using dataframe APIs.

Collectives™ on Stack Overflow

Writing SQL vs using Dataframe APIs in Spark SQL

4 Answers 4

RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):

Test results:

3 Comments

2 Comments

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):

Test results:

3 Comments

2 Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related