4

I'm trying to get the distinct values of a single column of a DataFrame (called: df) into an Array that matches the data type of the column. This is what I've tried, but it does not work:

def distinctValues[T: ClassTag](column: String): Array[T] = {
      df.select(df(column)).distinct.map {
        case Row(s: T) => s
      }.collect
    }

The method is inside an implicit class, so calling df.distinctValues("some_col") gives me:

scala.MatchError: [ABCD] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)

Is there an elegant way to achieve what I want that is also type safe?

I'm on Spark 1.4.1.

1
  • not sure why the pattern matching fails, but if you replace that map call with map(_.getAs[T](column)) you'll get what you want. Commented May 18, 2016 at 9:28

2 Answers 2

5

The problem is that you're using pattern matching instead of the getAs method:

implicit final class DataFrameOps(val df: DataFrame) {
  def distinctValues[T: ClassTag](column: String): Array[T] = {
     df.select(column).distinct().map(_.getAs[T](column)).collect()
  }
}

Usage:

val ageArray: Array[Int] = df.distinctValues("age")
or
val ageArray = df.distinctValues[Int]("age")
Sign up to request clarification or add additional context in comments.

8 Comments

That gives error: value class may not be a member of another class.
You should remove extends AnyVal then, so it will work even if that final class is nested in another class. I've edited the example, so it will be more general.
When I look at the limitations (docs.scala-lang.org/overviews/core/value-classes.html), I don't see why this is a problem though. DataFrame is not a value class itself. There are no nested traits/objects/classes, only defs, no specialized type parameters, it doesn't have a custom equals or hashCode function. If I place it inside an object it's member of a top-level statically accessible object. So what's the problem? I'm curious.
It should work (and works for me) if it's declared in an object. Have you checked if this final class is accessible from other files, for example using an import?
Hmm, I just wrapped it in object SomeTest and copy-pasted it in my Spark shell (1.4.1) and it gives me the error message. I checked the source for my version and it does not seem to extend AnyVal, so I'm not sure I understand the problem. github.com/apache/spark/releases/tag/v1.4.1
|
1

Since 1.4.0, Spark has dropDuplicates method that implements distinct by sequence of columns (or by all columns, if none is specified):

//drop duplicates considering specified columns
val distinctDf = df.select($"column").dropDuplicates(Seq("column"))
//this should work too since df has one column after select
val distinctDf = df.select($"column").dropDuplicates()
//collect
def getValues[T](df: DataFrame, columnName: String) = {
  df.map(_.getAs[T](columnName)).collect()
}

getValues[String](distinctDf, "column")

5 Comments

Which is just an alias for distinct, so it does exactly the same. Moreover, I'm looking for a way to have different types handled and not hard-coding String. So, my question is really about going from a DataFrame of a single column to an Array[T] with T the type of the column in the DataFrame.
See my updated answer. Yes, distinct is an alias for dropDuplicates version withoout parameters. dropDuplicates, as mentioned has another, more generic signature
Thanks! There is no way the type can be inferred? I mean: a DataFrame is a typed object, so that info should be available.
unlikely, because DataFrame is not strong typed - it does not have type parameters. That was the reason, why Databricks introduced DataSet recently. So if you can upgrade you Spark instance to 1.6 - skip DataFrames and try DataSets instead.
That's not accurate - a DataFrame can be matched using pattern matching (thanks to Row.unapply and the fact that DataFrame.map actually calls RDD[Row].map), it's just that a generic type (T in this case) can't be used in pattern matching this way. A version of the OP's matching expression using simpler generic container (e.g. List) doesn't work either. The workarounds are valid, but the root cause is different.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.