1

Given the following structure:

root

     |-- lvl_1: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- l1_Attribute: String
     ....
     |    |    |-- lvl_2: array
     |    |    |-- element: struct
     |    |    |    |-- l2_attribute: String
     ...

I want to filter by the l2_attribute inside the lvl_2 array, which is nested inside the lvl_1 array. Can this be done without exploding the lvl_1 array first?

I can filter the lvl_1 array without exploding it:

rdd.select(<columns>,
      expr("filter(lvl_1, lvl_1_struct -> upper(lvl_1_struct.l1_Attribute) == 'foo')")

But I can't figure out how to do this for the nested lvl_2 array.

2
  • 2
    Can you share an example input dataset and expected output please? Commented Jun 26, 2019 at 18:59
  • unless sample dataset example which is reproducable its hard to give solid answer Commented Jun 26, 2019 at 19:21

1 Answer 1

3

See if this helps. solution is to flatten the inner arrays and use org.apache.spark.sql.functions.array_contains function to filter.

If you are using spark 2.4+ you may use higher order function org.apache.spark.sql.functions.flatten instead of UDF as shown in the solution.(spark 2.3)

val df = Seq(
  Seq(
    ("a", Seq(2, 4, 6, 8, 10, 12)),
    ("b", Seq(3, 6, 9, 12)),
    ("c", Seq(1, 2, 3, 4))
  ),
  Seq(
    ("e", Seq(4, 8, 12)),
    ("f", Seq(1, 3, 6)),
    ("g", Seq(3, 4, 5, 6))
  )
).toDF("lvl_1")

df: org.apache.spark.sql.DataFrame = [lvl_1: array<struct<_1:string,_2:array<int>>>]

scala> df.show(false)
+------------------------------------------------------------------+
|lvl_1                                                             |
+------------------------------------------------------------------+
|[[a, [2, 4, 6, 8, 10, 12]], [b, [3, 6, 9, 12]], [c, [1, 2, 3, 4]]]|
|[[e, [4, 8, 12]], [f, [1, 3, 6]], [g, [3, 4, 5, 6]]]              |
+------------------------------------------------------------------+


scala> def flattenSeqOfSeq[S](x:Seq[Seq[S]]): Seq[S] = { x.flatten }
flattenSeqOfSeq: [S](x: Seq[Seq[S]])Seq[S]

scala> val myUdf = udf { flattenSeqOfSeq[Int] _}
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(IntegerType,false),Some(List(ArrayType(ArrayType(IntegerType,false),true))))

scala> df.withColumn("flattnedinnerarrays", myUdf($"lvl_1".apply("_2")))
res66: org.apache.spark.sql.DataFrame = [lvl_1: array<struct<_1:string,_2:array<int>>>, flattnedinnerarrays: array<int>]

scala> res66.show(false)
+------------------------------------------------------------------+---------------------------------------------+
|lvl_1                                                             |flattnedinnerarrays                          |
+------------------------------------------------------------------+---------------------------------------------+
|[[a, [2, 4, 6, 8, 10, 12]], [b, [3, 6, 9, 12]], [c, [1, 2, 3, 4]]]|[2, 4, 6, 8, 10, 12, 3, 6, 9, 12, 1, 2, 3, 4]|
|[[e, [4, 8, 12]], [f, [1, 3, 6]], [g, [3, 4, 5, 6]]]              |[4, 8, 12, 1, 3, 6, 3, 4, 5, 6]              |
+------------------------------------------------------------------+---------------------------------------------+

scala> res66.filter(array_contains($"flattnedinnerarrays", 10)).show(false)
+------------------------------------------------------------------+---------------------------------------------+
|lvl_1                                                             |flattnedinnerarrays                          |
+------------------------------------------------------------------+---------------------------------------------+
|[[a, [2, 4, 6, 8, 10, 12]], [b, [3, 6, 9, 12]], [c, [1, 2, 3, 4]]]|[2, 4, 6, 8, 10, 12, 3, 6, 9, 12, 1, 2, 3, 4]|
+------------------------------------------------------------------+---------------------------------------------+

scala> res66.filter(array_contains($"flattnedinnerarrays", 3)).show(false)
+------------------------------------------------------------------+---------------------------------------------+
|lvl_1                                                             |flattnedinnerarrays                          |
+------------------------------------------------------------------+---------------------------------------------+
|[[a, [2, 4, 6, 8, 10, 12]], [b, [3, 6, 9, 12]], [c, [1, 2, 3, 4]]]|[2, 4, 6, 8, 10, 12, 3, 6, 9, 12, 1, 2, 3, 4]|
|[[e, [4, 8, 12]], [f, [1, 3, 6]], [g, [3, 4, 5, 6]]]              |[4, 8, 12, 1, 3, 6, 3, 4, 5, 6]              |
+------------------------------------------------------------------+---------------------------------------------+
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, I'll give it a go.
@Andrew If it worked can you please accept the answer?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.