0

I'm learning databricks with a friend and one thing I really do not get it.

I'm trying to query using pyspark and spark sql in a json file that is in a storage account in azure.

The path of the file in azure is this: 'abfss://[email protected]/raw_files/'

In databricks i've created the following statement to create the dataframe:

df = spark.read.format("json").load("abfss://[email protected]/raw_files/")

Ok until there, but:

Knowing that I've created a dataframe, why I can't query it using pyspark ou spark sql?

If I use this statement, just to exemplify:

SELECT * FROM df

It will not work.

However, when i do this it will:

df = spark.read.format("json").load("abfss://[email protected]/raw_files/")
df.createOrReplaceTempView('df_view')**
SELECT * FROM df_view**;

He said that this occurs because pyspark and spark sql are API(my doubt lies here).

Why this happen? and what are the other ways beside createOrReplaceTempView?

Could someone give some advice?

2
  • 1
    A dataframe is not a SQL object. Creating a view gives you a SQL interface to the dataframe. Commented Nov 13, 2024 at 14:40
  • Could you explain what does SQL object stands for? If i specify the path without creating a view, it ill be a problem? Commented Nov 13, 2024 at 16:04

2 Answers 2

0

Have a look at this page from Azure Databricks. In the SQL area, you see the following example, when translated to your link:

SELECT * FROM json.`abfss://[email protected]/raw_files/`

I think that's a nice alternative in your case if you prefer using SQL.

Sign up to request clarification or add additional context in comments.

3 Comments

Wow. That's interesting. I thought that this only work in a delta file. Do i need to create a view to query or specifying the path like this is enough to query using pyspark or spark sql?
Pretty sure this requires unity catalog.
I believe it might be correct that you need to have Unity Catalog. If you do not have that, you can use the COPY INTO command as specified in this link. To do that, you do need to set some spark configuration as the page describes. Documentation of how to do this for JSON can be found here. In this scenario you do need to make a table, in the UC option you do not need to make a view or table.
0

Your friend is correct, the spark api is the Dataframe (Dataset over Rows). The use of createOrReplaceTempView enables the SQL interface but (with only a few exceptions) the SQL is parsed into the same expression trees and query plans as using the api.

In order to be able to just use SQL you have to register that dataset on adls as an external table via create table using EXTERNAL and LOCATION (and make sure databricks can access it etc.).

1 Comment

Im trying, firstable, to access only in the raw level the values and work with them (it's like a homework) and i need to work it using pyspark and spark sql.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.