1

I have a dataframe as follows:

+-----------+
|        f1 |
+-----------+
|[a,b,c]    |
|[e,f,g]    |
|[h,i]      |
+-----------+

I want to explode it to rows along with a repeated unique number field as follows:

+-----------+--------+
|        f1 |     uid|
+-----------+--------+
|a          |       1|
|b          |       1|
|c          |       1|
|e          |       2|
|f          |       2|
|g          |       2|
|h          |       3|
|i          |       3|
+-----------+--------+

I can perform explode directly as explained here - Spark: Explode a dataframe array of structs and append id

but I am not sure on how to add the uid field to the new dataframe so that each exploded array field would have the same uid and other elements have different uid values.

1 Answer 1

4

The right way to do it, is to use monotonically_increasing_id

val df = Seq(Seq("a", "b", "c"), Seq("e", "f", "g"), Seq("h", "i")).toDF("f1")

df
  .withColumn("uid", monotonically_increasing_id)
  .withColumn("f1", explode($"f1"))
  .show
// +---+---+                                                                       
// | f1|uid|
// +---+---+
// |  a|  0|
// |  b|  0|
// |  c|  0|
// |  e|  1|
// |  f|  1|
// |  g|  1|
// |  h|  2|
// |  i|  2|
// +---+---+

The number won't necessary be consecutive as in the example, but will uniquely identify the source.

Don't use rank().over(Window.orderBy("f1")). It is inherently sequential and not scalable and such should be avoided with exception to local Datasets (i.e. one which return true from isLocal).

Sign up to request clarification or add additional context in comments.

1 Comment

So I find your don't... interesting. In that there is nothing but blurb on the under the hood optimization with use of DF and DS. You state the opposite, which I concur with.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.