3

I would like to run sql query on dataframe but do I have to create a view on this dataframe? Is there any easier way?

df = spark.createDataFrame([
('a', 1, 1), ('a',1, None), ('b', 1, 1),
('c',1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')

and the query I want to run some complex queries against this dataframe. For example I can do

df.createOrReplaceTempView("temp_view")
df_new = pyspark.sql("select id,max(foo) from temp_view group by id")

but do I have to convert it to view first before querying it? I know there is a dataframe equivalent operation. The above query is only an example.

1 Answer 1

2

You can just do

df.select('id', 'foo')

This will return a new Spark DataFrame with columns id and foo.

Sign up to request clarification or add additional context in comments.

8 Comments

Thank you. I would like to run a more complex query against this table. Above is an example.
Could you elaborate on the exact operation (s) you're looking to perform?
Besides converting the dataframe to a view then querying it, do we have different way to run SQL query aginst this dataframe?
You can explore using expr from pyspark.sql.functions. More info can be found in the docs: spark.apache.org/docs/latest/api/python/reference/api/…
Thanks. I have also seen under spark document, the spark.sql has been used interchangeably with pyspark.sql, are they the same? spark.apache.org/docs/latest/api/python/reference/api/…
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.