Why do I need to use dataframe to work with queries in databricks?

Question

I'm learning databricks with a friend and one thing I really do not get it.

I'm trying to query using pyspark and spark sql in a json file that is in a storage account in azure.

The path of the file in azure is this: 'abfss://[email protected]/raw_files/'

In databricks i've created the following statement to create the dataframe:

df = spark.read.format("json").load("abfss://[email protected]/raw_files/")

Ok until there, but:

Knowing that I've created a dataframe, why I can't query it using pyspark ou spark sql?

If I use this statement, just to exemplify:

SELECT * FROM df

It will not work.

However, when i do this it will:

df = spark.read.format("json").load("abfss://[email protected]/raw_files/")
df.createOrReplaceTempView('df_view')**
SELECT * FROM df_view**;

He said that this occurs because pyspark and spark sql are API(my doubt lies here).

Why this happen? and what are the other ways beside createOrReplaceTempView?

Could someone give some advice?

A dataframe is not a SQL object. Creating a view gives you a SQL interface to the dataframe. — Andrew
– Andrew, Commented Nov 13, 2024 at 14:40
Could you explain what does SQL object stands for? If i specify the path without creating a view, it ill be a problem? — Roberto Tosta
– Roberto Tosta, Commented Nov 13, 2024 at 16:04

Arend-Jan Tissing · Accepted Answer · 2024-11-13 14:20:35Z

0

Have a look at this page from Azure Databricks. In the SQL area, you see the following example, when translated to your link:

SELECT * FROM json.`abfss://[email protected]/raw_files/`

I think that's a nice alternative in your case if you prefer using SQL.

answered Nov 13, 2024 at 14:20

Arend-Jan Tissing

3761 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Roberto Tosta Over a year ago

Wow. That's interesting. I thought that this only work in a delta file. Do i need to create a view to query or specifying the path like this is enough to query using pyspark or spark sql?

Andrew Over a year ago

Pretty sure this requires unity catalog.

Arend-Jan Tissing Over a year ago

I believe it might be correct that you need to have Unity Catalog. If you do not have that, you can use the COPY INTO command as specified in this link. To do that, you do need to set some spark configuration as the page describes. Documentation of how to do this for JSON can be found here. In this scenario you do need to make a table, in the UC option you do not need to make a view or table.

Nimantha · Accepted Answer · 2025-06-24 15:28:27Z

0

Your friend is correct, the spark api is the Dataframe (Dataset over Rows). The use of createOrReplaceTempView enables the SQL interface but (with only a few exceptions) the SQL is parsed into the same expression trees and query plans as using the api.

In order to be able to just use SQL you have to register that dataset on adls as an external table via create table using EXTERNAL and LOCATION (and make sure databricks can access it etc.).

edited Jun 24 at 15:28

Nimantha

6,5276 gold badges32 silver badges78 bronze badges

answered Nov 13, 2024 at 14:20

Chris

2,96516 silver badges15 bronze badges

1 Comment

Roberto Tosta Over a year ago

Im trying, firstable, to access only in the raw level the values and work with them (it's like a homework) and i need to work it using pyspark and spark sql.

Collectives™ on Stack Overflow

Why do I need to use dataframe to work with queries in databricks?

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related