Converting SQL query to Spark Dataframe structured data processing

Question

I want to convert the below query in spark dataframe:

sqlContext.sql("SELECT d.dep_name,count(*) FROM employees e,department d WHERE e.dep_id = d.dep_id GROUP BY d.dep_name HAVING count(*) >= 2").show

Output:

+---------+---+                                                                 
| dep_name|_c1|
+---------+---+
|  FINANCE|  3|
|    AUDIT|  5|
|MARKETING|  6|

I tried it using below query:

scala> finalEmployeesDf.as("df1").join(depDf.as("df2"), $"df1.dep_id" === $"df2.dep_id").select($"dep_name").groupBy($"dep_name").count.show()
+---------+-----+                                                               
| dep_name|count|
+---------+-----+
|  FINANCE|    3|
|    AUDIT|    5|
|MARKETING|    6|
+---------+-----+

I know that this isn't correct coz suppose we have a case where we have only single entry for department then it will be also listed in these results but I want results to be displayed only if counts are greater than 2. So how can I achieve this ??? I tried googling but of no help in this case.

There is no performance difference between the queries and dataframe operation, so why would you need to do this? — OneCricketeer
– OneCricketeer, Commented Aug 19, 2018 at 16:20
I am just learning from certification perspective @cricket_007 — RushHour
– RushHour, Commented Aug 25, 2018 at 5:31

RushHour · Accepted Answer · 2018-08-19 12:21:09Z

1

You have the group and aggregate parts wrong. You need to select all the relevant columns, group by and aggregate by the relevant once. Here is untested code which will represent the correct approach:

finalEmployeesDf.as("df1")
 .join(depDf.as("df2"), $"df1.dep_id" === $"df2.dep_id")
 .select($"dep_name")
 .groupBy($"dep_name")
 .agg(count($"dep_name").as("cnt"))
 .filter($"cnt" > 2)
 .show()

A general suggestion is to try and break the API calls into several lines, this makes reading and understanding much easier.

edited Aug 19, 2018 at 12:21

RushHour

6451 gold badge10 silver badges34 bronze badges

answered Aug 19, 2018 at 9:51

antonpuz

3,3564 gold badges32 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

RushHour Over a year ago

works like a charm. But i had to remove functions before count. What is the significance of that

antonpuz Over a year ago

@Debuggerr this is ok, count is a function from the functions package. It includes many built in functions for processing of dataframes, and count is one of them which is applied after groupBy. you should really take a look on them: spark.apache.org/docs/latest/api/scala/…

antonpuz Over a year ago

@Debuggerr also please consider accepting/upvoting the answer if it worked and helped you

RushHour Over a year ago

But when i tried executing your code it throws an error that functions couldnt be found. I am working on 1.6 version of spark. So do i need to import it explicitly in order to make it working ??

antonpuz Over a year ago

@Debuggerr for pyspark use from pyspark.sql.functions import *, for Scala/Java I always use functions.<function>, but you can use the equivalent import org.apache.spark.sql.functions;

Ged · Accepted Answer · 2018-08-19 09:51:49Z

1

Try something like this:

DF.groupBy("x").agg(count("*").alias("cnt")).where($"cnt" > 2)

answered Aug 19, 2018 at 9:51

Ged

18.5k8 gold badges53 silver badges108 bronze badges

1 Comment

RushHour Over a year ago

Simple and sweet !

Collectives™ on Stack Overflow

Converting SQL query to Spark Dataframe structured data processing

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related