0

I'm new to Scala and am having a hard time working with a simple dataset in Spark. I want to be able to review the following dataset ordering by EventType and crow, but can't get it to do it by Descending value. I also want to read out just one eventType at a time.

when I try

dataset.orderBy("eventType")

It works, but if I add a '.desc' it doesn't work.

scala> setB.orderBy("eventType").desc
<console>:32: error: value desc is not a member of 
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
   setB.orderBy("eventType").desc

or

scala> dataset.orderBy("eventType".desc)
<console>:32: error: value desc is not a member of String
   dataset.orderBy("eventType".desc)

I also am trying to use Filter, but it doesn't like anything I try either. something like: dataset.filter("eventType"="agg%")

Sample dataset:

+----------------+------------------------------------------------------------------------------------+-----------------------------------+-------------+----------------+----+
|deadletterbucket|split                                                                               |eventType                          |clientVersion|dDeviceSurrogate|crow|
+----------------+------------------------------------------------------------------------------------+-----------------------------------+-------------+----------------+----+
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.3.0.108    |1               |3   |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.3.0.10     |1               |11  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.9.1.10     |3               |11  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.7.0.1      |3               |15  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.5.0.5      |6               |16  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.0.0.62     |7               |26  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.6.4.6      |9               |31  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_network_traffic|7.12.0.113   |1               |1   |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_network_traffic|6.3.2.15     |1               |2   |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_network_traffic|5.1.2.10     |1               |3   |

Ideally, I am trying to get something like the following to work

dataset.orderBy("crow").desc.filter("eventType"="%app_launches").show(3,false)


|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.5.0.5      |6               |31  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.0.0.62     |7               |26  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.6.4.6      |9               |16  |

2 Answers 2

3

You almost have the correct solution, just missing syntax details. The correct syntax with Spark(scala) is as below,


 import org.apache.spark.sql.functions._

 dataset.orderBy(desc("crow")).filter($"eventType".contains("app_launches")).show(3, false)

You can access the column using either $ or col you can find more information here (https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/Column.html)

I can also recommend going through this tutorial from spark homepage, its quite helpful! https://spark.apache.org/docs/2.1.0/sql-programming-guide.html

Sign up to request clarification or add additional context in comments.

Comments

2

You are passing a String to identify the column you wish to order by. This is a convenience method, but if you want more control you need to pass a Column argument instead. Spark offers several idiomatic ways of retrieving this object from the dataset:

dataset.orderBy($"crow".desc)...

dataset.orderBy(col("crow").desc)...

dataset.orderBy('crow.desc)...

dataset.orderBy(dataset("crow").desc)...

See https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@sort(sortExprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.