filter spark dataframe with row field that is an array of strings

Question

Using Spark 1.5 and Scala 2.10.6

I'm trying to filter a dataframe via a field "tags" that is an array of strings. Looking for all rows that have the tag 'private'.

val report = df.select("*")
  .where(df("tags").contains("private"))

getting:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'Contains(tags, private)' due to data type mismatch: argument 1 requires string type, however, 'tags' is of array type.;

Is the filter method better suited?

UPDATED:

the data is coming from cassandra adapter but a minimal example that shows what I'm trying to do and also gets the above error is:

  def testData (sc: SparkContext): DataFrame = {
    val stringRDD = sc.parallelize(Seq("""
      { "name": "ed",
        "tags": ["red", "private"]
      }""",
      """{ "name": "fred",
        "tags": ["public", "blue"]
      }""")
    )
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._
    sqlContext.read.json(stringRDD)
  }
  def run(sc: SparkContext) {
    val df1 = testData(sc)
    df1.show()
    val report = df1.select("*")
      .where(df1("tags").contains("private"))
    report.show()
  }

UPDATED: the tags array can be any length and the 'private' tag can be in any position

UPDATED: one solution that works: UDF

val filterPriv = udf {(tags: mutable.WrappedArray[String]) => tags.contains("private")}
val report = df1.filter(filterPriv(df1("tags")))

Well, after looking at the source code (since the scaladoc for Column.contains says only "Contains the other element" which is not very enlightening), I see that Column.contains constructs an instance of org.apache.spark.sql.catalyst.expressions.Contains which says "A function that returns true if the string left contains the string right". So it seems that df1("tags").contains cannot do what we want it to do in this case. I don't know what alternative to suggest. There is an ArrayContains also in ...expressions but Column doesn't seem to make use of it. — Robert Dodier
– Robert Dodier, Commented Jan 17, 2016 at 3:26
Indeed, after changing the data to just strings instead of an array of strings, I find that the query succeeds. — Robert Dodier
– Robert Dodier, Commented Jan 17, 2016 at 3:33
@DavidMaust, I got a UDF to work: val filterPriv = udf {(tags: mutable.WrappedArray[String]) => tags.contains("private")}; val report = df1.filter(filterPriv(df1("tags"))) still looking for something nicer but at least I'm not blocked. thx! — navicore
– navicore, Commented Jan 17, 2016 at 15:59

Robert Dodier · Accepted Answer · 2016-01-17 21:20:21Z

32

I think if you use where(array_contains(...)) it will work. Here's my result:

scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContext

scala> import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrame

scala> def testData (sc: SparkContext): DataFrame = {
     |     val stringRDD = sc.parallelize(Seq
     |      ("""{ "name": "ned", "tags": ["blue", "big", "private"] }""",
     |       """{ "name": "albert", "tags": ["private", "lumpy"] }""",
     |       """{ "name": "zed", "tags": ["big", "private", "square"] }""",
     |       """{ "name": "jed", "tags": ["green", "small", "round"] }""",
     |       """{ "name": "ed", "tags": ["red", "private"] }""",
     |       """{ "name": "fred", "tags": ["public", "blue"] }"""))
     |     val sqlContext = new org.apache.spark.sql.SQLContext(sc)
     |     import sqlContext.implicits._
     |     sqlContext.read.json(stringRDD)
     |   }
testData: (sc: org.apache.spark.SparkContext)org.apache.spark.sql.DataFrame

scala>   
     | val df = testData (sc)
df: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]

scala> val report = df.select ("*").where (array_contains (df("tags"), "private"))
report: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]

scala> report.show
+------+--------------------+
|  name|                tags|
+------+--------------------+
|   ned|[blue, big, private]|
|albert|    [private, lumpy]|
|   zed|[big, private, sq...|
|    ed|      [red, private]|
+------+--------------------+

Note that it works if you write where(array_contains(df("tags"), "private")), but if you write where(df("tags").array_contains("private")) (more directly analogous to what you wrote originally) it fails with array_contains is not a member of org.apache.spark.sql.Column. Looking at the source code for Column, I see there's some stuff to handle contains (constructing a Contains instance for that) but not array_contains. Maybe that's an oversight.

answered Jan 17, 2016 at 21:20

Robert Dodier

17.7k2 gold badges35 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

7kemZmani Over a year ago

.select("*") is not needed => df.where(...) ...

Ganesh Jadhav Over a year ago

Need to import org.apache.spark.sql.functions.array_contains before one can use this method.

Aravind Yarram · Accepted Answer · 2016-01-17 17:32:52Z

1

You can use ordinal to refer to the json array's for e.g. in your case df("tags")(0). Here is a working sample

scala> val stringRDD = sc.parallelize(Seq("""
     |       { "name": "ed",
     |         "tags": ["private"]
     |       }""",
     |       """{ "name": "fred",
     |         "tags": ["public"]
     |       }""")
     |     )
stringRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[87] at parallelize at <console>:22

scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> sqlContext.read.json(stringRDD)
res28: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]

scala> val df=sqlContext.read.json(stringRDD)
df: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]

scala> df.columns
res29: Array[String] = Array(name, tags)

scala> df.dtypes
res30: Array[(String, String)] = Array((name,StringType), (tags,ArrayType(StringType,true)))

scala> val report = df.select("*").where(df("tags")(0).contains("private"))
report: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]

scala> report.show
+----+-------------+
|name|         tags|
+----+-------------+
|  ed|List(private)|
+----+-------------+

edited Jan 17, 2016 at 17:32

answered Jan 17, 2016 at 3:39

Aravind Yarram

80.5k49 gold badges239 silver badges335 bronze badges

7 Comments

navicore Over a year ago

thanks. works if pos is fixed but it isn't. I should have made the test data a little more complex, there can be any number of tags in the array, position is arbitrary.

Aravind Yarram Over a year ago

@navicore then your code should work. check my update

navicore Over a year ago

interesting, I'm missing something, looks like exactly what I was doing but getting the error for. double checking spark versions now...

Aravind Yarram Over a year ago

@navicore this is on 1.5.4

navicore Over a year ago

thx. I must be crossing hands somewhere. I tried 1.5.1 and 1.6 and val report = df.select("*").where(df("tags").contains("private")) gives me that error in the orig post. digging...

|

Collectives™ on Stack Overflow

filter spark dataframe with row field that is an array of strings

2 Answers 2

2 Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related