Filtering non-exploded structure in SPARK Data Frame with Scala

Question

I have:

 +-----------------------+-------+------------------------------------+
 |cities                 |name   |schools                             |
 +-----------------------+-------+------------------------------------+
 |[palo alto, menlo park]|Michael|[[stanford, 2010], [berkeley, 2012]]|
 |[santa cruz]           |Andy   |[[ucsb, 2011]]                      |
 |[portland]             |Justin |[[berkeley, 2014]]                  |
 +-----------------------+-------+------------------------------------+

I get this no sweat:

 val res = df.select ("*").where (array_contains (df("schools.sname"), "berkeley")).show(false)

But without wanting to explode or using an UDF, I in the same way or similar as above, how can I do something like:

 return all rows where at least 1 schools.sname starts with "b"  ?

e.g.:

 val res = df.select ("*").where (startsWith (df("schools.sname"), "b")).show(false)

This is wrong of course, just to demonstrate the point. But how can I do something like this without exploding or UDF-usage returning true/false or whatever and filtering in general without UDF usage? May be it is not possible. I cannot find any such examples. Or is it expr I need?

Answers gotten which show how certain things have a certain approach as some capabilities do not exist in SCALA. I read an article that points out to new array features to be implemented after this, so proves a point.

@Leo C I am wondering if you could shed some light on this by any chance? — Ged
– Ged, Commented Sep 27, 2018 at 11:38
how did you name it as "schools.sname" when creating the DF? — stack0114106
– stack0114106, Commented Sep 27, 2018 at 14:27
@stack0114106 Inferred via spark.read.json - that is all correct — Ged
– Ged, Commented Sep 27, 2018 at 16:44

stack0114106 · Accepted Answer · 2018-09-27 18:31:31Z

1

How about this.

scala> val df = Seq ( ( Array("palo alto", "menlo park"), "Michael", Array(("stanford", 2010), ("berkeley", 2012))),
     |     (Array(("santa cruz")),"Andy",Array(("ucsb", 2011))),
     |       (Array(("portland")),"Justin",Array(("berkeley", 2014)))
     |     ).toDF("cities","name","schools")
df: org.apache.spark.sql.DataFrame = [cities: array<string>, name: string ... 1 more field]

scala> val df2 = df.select ("*").withColumn("sch1",df("schools._1"))
df2: org.apache.spark.sql.DataFrame = [cities: array<string>, name: string ... 2 more fields]

scala> val df3=df2.select("*").withColumn("sch2",concat_ws(",",df2("sch1")))
df3: org.apache.spark.sql.DataFrame = [cities: array<string>, name: string ... 3 more fields]

scala> df3.select("*").where( df3("sch2") rlike "^b|,b" ).show(false)
+-----------------------+-------+------------------------------------+--------------------+-----------------+
|cities                 |name   |schools                             |sch1                |sch2             |
+-----------------------+-------+------------------------------------+--------------------+-----------------+
|[palo alto, menlo park]|Michael|[[stanford, 2010], [berkeley, 2012]]|[stanford, berkeley]|stanford,berkeley|
|[portland]             |Justin |[[berkeley, 2014]]                  |[berkeley]          |berkeley         |
+-----------------------+-------+------------------------------------+--------------------+-----------------+

in one more step, you can drop the unwanted columns.

answered Sep 27, 2018 at 18:31

stack0114106

8,8934 gold badges16 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ged Over a year ago

OK, the reality is that with the functions I was thinking about, it was not possible. I answered a question with rlike yesterday BTW. Cheers.

Leonard Kerr · Accepted Answer · 2018-09-27 17:24:26Z

1

I'm not sure whether this qualifies as a UDF or not, but you could define a new filter function. If using a Dataset[Student] where:

case class School(sname: String, year: Int)
case class Student(cities: Seq[String], name: String, schools: Seq[School])

Then you can simply do the following:

students
    .filter(
        r => r.schools.filter(_.sname.startsWith("b")).size > 0)

However, if you are just using a DataFrame then:

import org.apache.spark.sql.Row

students.toDF
    .filter(
        r => r.getAs[Seq[Row]]("schools").filter(_.getAs[String]("name")
                                         .startsWith("b")).size > 0)

Both of which will result in:

+-----------------------+-------+------------------------------------+
|cities                 |name   |schools                             |
+-----------------------+-------+------------------------------------+
|[palo alto, menlo park]|Michael|[[stanford, 2010], [berkeley, 2012]]|
|[portland]             |Justin |[[berkeley, 2014]]                  |
+-----------------------+-------+------------------------------------+

edited Sep 27, 2018 at 17:24

answered Sep 27, 2018 at 17:08

Leonard Kerr

114 bronze badges

1 Comment

Ged Over a year ago

Welcome. I was looking for the select style with functions as opposed to (r => r etc. But OK, this seems to prove at least that that approach is most likely not possible and I was not missing something in that regard. I will wait and see if @Leo C comes back, otherwise the marks will go to you for the answer as well as the upvote now. Thx

Collectives™ on Stack Overflow

Filtering non-exploded structure in SPARK Data Frame with Scala

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related