I want to convert the below query in spark dataframe:
sqlContext.sql("SELECT d.dep_name,count(*) FROM employees e,department d WHERE e.dep_id = d.dep_id GROUP BY d.dep_name HAVING count(*) >= 2").show
Output:
+---------+---+
| dep_name|_c1|
+---------+---+
| FINANCE| 3|
| AUDIT| 5|
|MARKETING| 6|
I tried it using below query:
scala> finalEmployeesDf.as("df1").join(depDf.as("df2"), $"df1.dep_id" === $"df2.dep_id").select($"dep_name").groupBy($"dep_name").count.show()
+---------+-----+
| dep_name|count|
+---------+-----+
| FINANCE| 3|
| AUDIT| 5|
|MARKETING| 6|
+---------+-----+
I know that this isn't correct coz suppose we have a case where we have only single entry for department then it will be also listed in these results but I want results to be displayed only if counts are greater than 2. So how can I achieve this ??? I tried googling but of no help in this case.