Selecting with conditions from a dataframe using Spark Scala

Question

I'm new to Scala and am having a hard time working with a simple dataset in Spark. I want to be able to review the following dataset ordering by EventType and crow, but can't get it to do it by Descending value. I also want to read out just one eventType at a time.

when I try

dataset.orderBy("eventType")

It works, but if I add a '.desc' it doesn't work.

scala> setB.orderBy("eventType").desc
<console>:32: error: value desc is not a member of 
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
   setB.orderBy("eventType").desc

or

scala> dataset.orderBy("eventType".desc)
<console>:32: error: value desc is not a member of String
   dataset.orderBy("eventType".desc)

I also am trying to use Filter, but it doesn't like anything I try either. something like: dataset.filter("eventType"="agg%")

Sample dataset:

+----------------+------------------------------------------------------------------------------------+-----------------------------------+-------------+----------------+----+
|deadletterbucket|split                                                                               |eventType                          |clientVersion|dDeviceSurrogate|crow|
+----------------+------------------------------------------------------------------------------------+-----------------------------------+-------------+----------------+----+
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.3.0.108    |1               |3   |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.3.0.10     |1               |11  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.9.1.10     |3               |11  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.7.0.1      |3               |15  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.5.0.5      |6               |16  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.0.0.62     |7               |26  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.6.4.6      |9               |31  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_network_traffic|7.12.0.113   |1               |1   |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_network_traffic|6.3.2.15     |1               |2   |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_network_traffic|5.1.2.10     |1               |3   |

Ideally, I am trying to get something like the following to work

dataset.orderBy("crow").desc.filter("eventType"="%app_launches").show(3,false)


|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.5.0.5      |6               |31  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.0.0.62     |7               |26  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.6.4.6      |9               |16  |

Sai Kiran KrishnaMurthy · Accepted Answer · 2019-07-05 21:16:37Z

3

You almost have the correct solution, just missing syntax details. The correct syntax with Spark(scala) is as below,


 import org.apache.spark.sql.functions._

 dataset.orderBy(desc("crow")).filter($"eventType".contains("app_launches")).show(3, false)

You can access the column using either $ or col you can find more information here (https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/Column.html)

I can also recommend going through this tutorial from spark homepage, its quite helpful! https://spark.apache.org/docs/2.1.0/sql-programming-guide.html

answered Jul 5, 2019 at 21:16

Sai Kiran KrishnaMurthy

7175 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Charlie Flowers · Accepted Answer · 2019-07-05 21:13:51Z

2

You are passing a String to identify the column you wish to order by. This is a convenience method, but if you want more control you need to pass a Column argument instead. Spark offers several idiomatic ways of retrieving this object from the dataset:

dataset.orderBy($"crow".desc)...

dataset.orderBy(col("crow").desc)...

dataset.orderBy('crow.desc)...

dataset.orderBy(dataset("crow").desc)...

See https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@sort(sortExprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]

answered Jul 5, 2019 at 21:13

Charlie Flowers

1,3807 silver badges12 bronze badges

Collectives™ on Stack Overflow

Selecting with conditions from a dataframe using Spark Scala

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related