Using Spark 1.5 and Scala 2.10.6
I'm trying to filter a dataframe via a field "tags" that is an array of strings. Looking for all rows that have the tag 'private'.
val report = df.select("*")
.where(df("tags").contains("private"))
getting:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'Contains(tags, private)' due to data type mismatch: argument 1 requires string type, however, 'tags' is of array type.;
Is the filter method better suited?
UPDATED:
the data is coming from cassandra adapter but a minimal example that shows what I'm trying to do and also gets the above error is:
def testData (sc: SparkContext): DataFrame = {
val stringRDD = sc.parallelize(Seq("""
{ "name": "ed",
"tags": ["red", "private"]
}""",
"""{ "name": "fred",
"tags": ["public", "blue"]
}""")
)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
sqlContext.read.json(stringRDD)
}
def run(sc: SparkContext) {
val df1 = testData(sc)
df1.show()
val report = df1.select("*")
.where(df1("tags").contains("private"))
report.show()
}
UPDATED: the tags array can be any length and the 'private' tag can be in any position
UPDATED: one solution that works: UDF
val filterPriv = udf {(tags: mutable.WrappedArray[String]) => tags.contains("private")}
val report = df1.filter(filterPriv(df1("tags")))
Column.containssays only "Contains the other element" which is not very enlightening), I see thatColumn.containsconstructs an instance oforg.apache.spark.sql.catalyst.expressions.Containswhich says "A function that returns true if the stringleftcontains the stringright". So it seems thatdf1("tags").containscannot do what we want it to do in this case. I don't know what alternative to suggest. There is anArrayContainsalso in...expressionsbutColumndoesn't seem to make use of it.val filterPriv = udf {(tags: mutable.WrappedArray[String]) => tags.contains("private")}; val report = df1.filter(filterPriv(df1("tags")))still looking for something nicer but at least I'm not blocked. thx!