0

I want to convert the following statement into DF select statement:

Select 
 YY,
 PP,
 YYYY,
 PPPP,
 Min(ID) as MinId, 
 Max(ID) as MaxID 
from LoadTable

I have tried the following but it doesnt seem to be working :

df.select(df.ID,df.YY, df.PP,df.YYYY,df.PPPPP).agg({"ID": "max", "ID": "min"}).toPandas().to_csv(outputFile, sep="|", header=True, index=False)
1
  • 2
    Did you check each component of the statement? Can you provide the errors you see. You see a bunch of things that you need to consider before posting on here. Commented Oct 21, 2016 at 18:42

1 Answer 1

2

As you are performing an aggregate function, what you may be missing here is the GROUP BY statement. If so, your SQL statement would be:

SELECT YY, PP, YYYY, PPPP, Min(ID) as MinId, Max(ID) as MaxID 
  FROM LoadTable 
 GROUP BY YY, PP, YYYY, PPPP

The corresponding PySpark DataFrame statement would be then

from pyspark.sql import functions as F
df.groupBy(df.YY, df.PP, df.YYYY, df.PPPP).agg(F.min(df.ID), F.max(df.ID))

HTH!

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.