0

I have:

 +-----------------------+-------+------------------------------------+
 |cities                 |name   |schools                             |
 +-----------------------+-------+------------------------------------+
 |[palo alto, menlo park]|Michael|[[stanford, 2010], [berkeley, 2012]]|
 |[santa cruz]           |Andy   |[[ucsb, 2011]]                      |
 |[portland]             |Justin |[[berkeley, 2014]]                  |
 +-----------------------+-------+------------------------------------+

I get this no sweat:

 val res = df.select ("*").where (array_contains (df("schools.sname"), "berkeley")).show(false)

But without wanting to explode or using an UDF, I in the same way or similar as above, how can I do something like:

 return all rows where at least 1 schools.sname starts with "b"  ?

e.g.:

 val res = df.select ("*").where (startsWith (df("schools.sname"), "b")).show(false)

This is wrong of course, just to demonstrate the point. But how can I do something like this without exploding or UDF-usage returning true/false or whatever and filtering in general without UDF usage? May be it is not possible. I cannot find any such examples. Or is it expr I need?

Answers gotten which show how certain things have a certain approach as some capabilities do not exist in SCALA. I read an article that points out to new array features to be implemented after this, so proves a point.

3
  • @Leo C I am wondering if you could shed some light on this by any chance? Commented Sep 27, 2018 at 11:38
  • how did you name it as "schools.sname" when creating the DF? Commented Sep 27, 2018 at 14:27
  • @stack0114106 Inferred via spark.read.json - that is all correct Commented Sep 27, 2018 at 16:44

2 Answers 2

1

How about this.

scala> val df = Seq ( ( Array("palo alto", "menlo park"), "Michael", Array(("stanford", 2010), ("berkeley", 2012))),
     |     (Array(("santa cruz")),"Andy",Array(("ucsb", 2011))),
     |       (Array(("portland")),"Justin",Array(("berkeley", 2014)))
     |     ).toDF("cities","name","schools")
df: org.apache.spark.sql.DataFrame = [cities: array<string>, name: string ... 1 more field]

scala> val df2 = df.select ("*").withColumn("sch1",df("schools._1"))
df2: org.apache.spark.sql.DataFrame = [cities: array<string>, name: string ... 2 more fields]

scala> val df3=df2.select("*").withColumn("sch2",concat_ws(",",df2("sch1")))
df3: org.apache.spark.sql.DataFrame = [cities: array<string>, name: string ... 3 more fields]

scala> df3.select("*").where( df3("sch2") rlike "^b|,b" ).show(false)
+-----------------------+-------+------------------------------------+--------------------+-----------------+
|cities                 |name   |schools                             |sch1                |sch2             |
+-----------------------+-------+------------------------------------+--------------------+-----------------+
|[palo alto, menlo park]|Michael|[[stanford, 2010], [berkeley, 2012]]|[stanford, berkeley]|stanford,berkeley|
|[portland]             |Justin |[[berkeley, 2014]]                  |[berkeley]          |berkeley         |
+-----------------------+-------+------------------------------------+--------------------+-----------------+

in one more step, you can drop the unwanted columns.

Sign up to request clarification or add additional context in comments.

1 Comment

OK, the reality is that with the functions I was thinking about, it was not possible. I answered a question with rlike yesterday BTW. Cheers.
1

I'm not sure whether this qualifies as a UDF or not, but you could define a new filter function. If using a Dataset[Student] where:

case class School(sname: String, year: Int)
case class Student(cities: Seq[String], name: String, schools: Seq[School])

Then you can simply do the following:

students
    .filter(
        r => r.schools.filter(_.sname.startsWith("b")).size > 0)

However, if you are just using a DataFrame then:

import org.apache.spark.sql.Row

students.toDF
    .filter(
        r => r.getAs[Seq[Row]]("schools").filter(_.getAs[String]("name")
                                         .startsWith("b")).size > 0)

Both of which will result in:

+-----------------------+-------+------------------------------------+
|cities                 |name   |schools                             |
+-----------------------+-------+------------------------------------+
|[palo alto, menlo park]|Michael|[[stanford, 2010], [berkeley, 2012]]|
|[portland]             |Justin |[[berkeley, 2014]]                  |
+-----------------------+-------+------------------------------------+

1 Comment

Welcome. I was looking for the select style with functions as opposed to (r => r etc. But OK, this seems to prove at least that that approach is most likely not possible and I was not missing something in that regard. I will wait and see if @Leo C comes back, otherwise the marks will go to you for the answer as well as the upvote now. Thx

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.