Spark Sql Query On Nested Records. I want to do Filter nested array of records first and then explode(expand them into rows)

Question

I want to first filter only the rows which have Max and then i want to explode only rows that have Max in the nested column.

My Avro Record:

{
"name": "Parent",
"type":"record",
"fields":[
    {"name": "firstname", "type": "string"},

    {
        "name":"children",
        "type":{
            "type": "array",
            "items":{
                        "name":"child",
                        "type":"record",
                        "fields":[

                       {"name":"name", "type":"string"}                                                                    
                       {"name":"price","type":["long", "null"]}

                        ]
                    }
            }
    }
]
}

I am using Spark SQL context to query dataframe which is read. So if input is

Row no   Firstname Children.name
    1    John       [[Max, 20],[Pg, 22]]
    2    Bru        [[huna, 10], [aman, 12]]

I query first by exploding inner table. So nested column split into 2 rows.

Row no   Firstname Children.name    children.price
        1    John       Max               20
        1    John       Pg                22
        2    Bru        huna              10
        2    Bru        aman              12

q1)I want to first filter only the rows which have Max and then i want to explode only rows that have Max in it. In the current situation, if i have million of values in one column, than it first generate the million rows, and then check if Max is present.

q2) I want to first filter only the rows which have price > 12 and then i want to explode only rows that have price > 12 in it. In the current situation, if i have million of values in one column, than it first generate the million rows, and then check if price > 12 is present.

Something like this: val results = sqlc.sql("SELECT firstname, child.name FROM parent LATERAL VIEW explode(children) childTable AS child where child.price > 12")

Filter on children name containing max to create a new Dataframe and then explode. Have you tried that ? — eliasah
– eliasah, Commented May 23, 2016 at 5:13
The following query doesn't work val results = sqlc.sql("SELECT firstname, children.name FROM parent where children.name = 'Max'") — pg20
– pg20, Commented May 23, 2016 at 5:17
I said contains not equals. You can find all the sql functions that you can use in the spark scala doc — eliasah
– eliasah, Commented May 23, 2016 at 5:19
Also please review your other questions ! They are hanging unresolved for now. — eliasah
– eliasah, Commented May 23, 2016 at 5:23

pg20 · Accepted Answer · 2016-05-23 08:46:19Z

1

Here are the ans to the two questions: ans1) If you want to find if "string" exist in a array of nested records:

var results = sqlc.sql("SELECT firstname, children.name  FROM parent where array_contains(children['name'], 'pg') ")

ans2) if you want to apply a condition on array of nested records. Using UDF

sqlc.udf.register("myPriceFilter", (price: mutable.WrappedArray[String]) => (price exists (a =>  (a.toLong < 67735) )))

var results = sqlc.sql("SELECT firstname, explode(children.price)  FROM parent where myPriceFilter(children['price']) ")

answered May 23, 2016 at 8:46

pg20

3396 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark Sql Query On Nested Records. I want to do Filter nested array of records first and then explode(expand them into rows)

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related