Scala Spark, array with incremental new column

Question

Spark is reading from cosmosDB, which contains records like:

{
    "answers": [
        {
            "answer": "2005-01-01 00:00",
            "answerDt": "2022-07-01CEST08:07",
...,
"id": {uuid}

}

and code that takes those answers and created DF where each row is new record from that array:

dataDF
    .select(
      col("id").as("recordId"),
      explode($"answers").as("qa")
    )
    .select(
      col("recordId"),
      $"qa.questionText",
      col("qa.question").as("q-id"),
      $"qa.answerText",
      $"qa.answerDt"
    )
    .withColumn("id", concat_ws("-", col("q-id"), col("recordId")))
    .drop(col("q-id"))

at the end I save it to other collection. What I need is that I would like to add position number into those records. So each answer row would have also some int number, which will be unique per recordId. ie: from 1 to 20.

                  lp||           recordId|        questionText|          answerText|           answerDt|                 id|
--------------------++-------------------+--------------------+--------------------+-------------------+-------------------+
1                   |951a508c-d970-4d2...|Please give me th...|              197...|2022-06-28CEST16:52|123abcde_VB_GEN_Q...|
2                   |951a508c-d970-4d2...|What X should I N...|              female|2022-06-28CEST16:52|123abcde_VB_GEN_Q...|
3                   |951a508c-d970-4d2...|Please Share me t...|               72 kg|2022-06-28CEST16:53|123abcde_VB_GEN_Q...|
1                   |12345678-0987-4d2...|Give me the smth ...|               10 kg|2022-06-28CEST16:53|123abcde_VB_GEN_Q...|

Is it possible ? thanks

@mvasyliv hmm thanks, but how to make row_number per recordId ? — FrancMo
– FrancMo, Commented Aug 1, 2022 at 12:54
The row_number() is a window function in Spark SQL that assigns a row number (sequential integer number) to each row in the result DataFrame. This function is used with Window.partitionBy() which partitions the data into windows frames and orderBy() clause to sort the rows in each partition. — mvasyliv
– mvasyliv, Commented Aug 1, 2022 at 13:38

mvasyliv · Accepted Answer · 2022-08-01 13:35:43Z

1

val w  = Window.partitionBy("recordId").orderBy("your col")
val resDF = sourceDF.withColumn("row_num", row_number.over(w))

edited Aug 1, 2022 at 13:35

answered Aug 1, 2022 at 13:10

mvasyliv

1,2148 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Scala Spark, array with incremental new column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related