1

I want to first filter only the rows which have Max and then i want to explode only rows that have Max in the nested column.

My Avro Record:

{
"name": "Parent",
"type":"record",
"fields":[
    {"name": "firstname", "type": "string"},

    {
        "name":"children",
        "type":{
            "type": "array",
            "items":{
                        "name":"child",
                        "type":"record",
                        "fields":[

                       {"name":"name", "type":"string"}                                                                    
                       {"name":"price","type":["long", "null"]}

                        ]
                    }
            }
    }
]
}

I am using Spark SQL context to query dataframe which is read. So if input is

Row no   Firstname Children.name
    1    John       [[Max, 20],[Pg, 22]]
    2    Bru        [[huna, 10], [aman, 12]]

I query first by exploding inner table. So nested column split into 2 rows.

Row no   Firstname Children.name    children.price
        1    John       Max               20
        1    John       Pg                22
        2    Bru        huna              10
        2    Bru        aman              12

q1)I want to first filter only the rows which have Max and then i want to explode only rows that have Max in it. In the current situation, if i have million of values in one column, than it first generate the million rows, and then check if Max is present.

q2) I want to first filter only the rows which have price > 12 and then i want to explode only rows that have price > 12 in it. In the current situation, if i have million of values in one column, than it first generate the million rows, and then check if price > 12 is present.

Something like this: val results = sqlc.sql("SELECT firstname, child.name FROM parent LATERAL VIEW explode(children) childTable AS child where child.price > 12")

7
  • 1
    Filter on children name containing max to create a new Dataframe and then explode. Have you tried that ? Commented May 23, 2016 at 5:13
  • The following query doesn't work val results = sqlc.sql("SELECT firstname, children.name FROM parent where children.name = 'Max'") Commented May 23, 2016 at 5:17
  • 1
    I said contains not equals. You can find all the sql functions that you can use in the spark scala doc Commented May 23, 2016 at 5:19
  • stackoverflow.com/a/35628252/1560062 Commented May 23, 2016 at 5:22
  • 1
    Also please review your other questions ! They are hanging unresolved for now. Commented May 23, 2016 at 5:23

1 Answer 1

1

Here are the ans to the two questions: ans1) If you want to find if "string" exist in a array of nested records:

var results = sqlc.sql("SELECT firstname, children.name  FROM parent where array_contains(children['name'], 'pg') ")

ans2) if you want to apply a condition on array of nested records. Using UDF

sqlc.udf.register("myPriceFilter", (price: mutable.WrappedArray[String]) => (price exists (a =>  (a.toLong < 67735) )))

var results = sqlc.sql("SELECT firstname, explode(children.price)  FROM parent where myPriceFilter(children['price']) ")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.