1

I want to convert the below query in spark dataframe:

sqlContext.sql("SELECT d.dep_name,count(*) FROM employees e,department d WHERE e.dep_id = d.dep_id GROUP BY d.dep_name HAVING count(*) >= 2").show  

Output:

+---------+---+                                                                 
| dep_name|_c1|
+---------+---+
|  FINANCE|  3|
|    AUDIT|  5|
|MARKETING|  6|

I tried it using below query:

scala> finalEmployeesDf.as("df1").join(depDf.as("df2"), $"df1.dep_id" === $"df2.dep_id").select($"dep_name").groupBy($"dep_name").count.show()
+---------+-----+                                                               
| dep_name|count|
+---------+-----+
|  FINANCE|    3|
|    AUDIT|    5|
|MARKETING|    6|
+---------+-----+  

I know that this isn't correct coz suppose we have a case where we have only single entry for department then it will be also listed in these results but I want results to be displayed only if counts are greater than 2. So how can I achieve this ??? I tried googling but of no help in this case.

2
  • There is no performance difference between the queries and dataframe operation, so why would you need to do this? Commented Aug 19, 2018 at 16:20
  • I am just learning from certification perspective @cricket_007 Commented Aug 25, 2018 at 5:31

2 Answers 2

1

You have the group and aggregate parts wrong. You need to select all the relevant columns, group by and aggregate by the relevant once. Here is untested code which will represent the correct approach:

finalEmployeesDf.as("df1")
 .join(depDf.as("df2"), $"df1.dep_id" === $"df2.dep_id")
 .select($"dep_name")
 .groupBy($"dep_name")
 .agg(count($"dep_name").as("cnt"))
 .filter($"cnt" > 2)
 .show()

A general suggestion is to try and break the API calls into several lines, this makes reading and understanding much easier.

Sign up to request clarification or add additional context in comments.

5 Comments

works like a charm. But i had to remove functions before count. What is the significance of that
@Debuggerr this is ok, count is a function from the functions package. It includes many built in functions for processing of dataframes, and count is one of them which is applied after groupBy. you should really take a look on them: spark.apache.org/docs/latest/api/scala/…
@Debuggerr also please consider accepting/upvoting the answer if it worked and helped you
But when i tried executing your code it throws an error that functions couldnt be found. I am working on 1.6 version of spark. So do i need to import it explicitly in order to make it working ??
@Debuggerr for pyspark use from pyspark.sql.functions import *, for Scala/Java I always use functions.<function>, but you can use the equivalent import org.apache.spark.sql.functions;
1

Try something like this:

DF.groupBy("x").agg(count("*").alias("cnt")).where($"cnt" > 2)

1 Comment

Simple and sweet !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.