0

Spark is reading from cosmosDB, which contains records like:

{
    "answers": [
        {
            "answer": "2005-01-01 00:00",
            "answerDt": "2022-07-01CEST08:07",
...,
"id": {uuid}

}

and code that takes those answers and created DF where each row is new record from that array:

dataDF
    .select(
      col("id").as("recordId"),
      explode($"answers").as("qa")
    )
    .select(
      col("recordId"),
      $"qa.questionText",
      col("qa.question").as("q-id"),
      $"qa.answerText",
      $"qa.answerDt"
    )
    .withColumn("id", concat_ws("-", col("q-id"), col("recordId")))
    .drop(col("q-id"))

at the end I save it to other collection. What I need is that I would like to add position number into those records. So each answer row would have also some int number, which will be unique per recordId. ie: from 1 to 20.

                  lp||           recordId|        questionText|          answerText|           answerDt|                 id|
--------------------++-------------------+--------------------+--------------------+-------------------+-------------------+
1                   |951a508c-d970-4d2...|Please give me th...|              197...|2022-06-28CEST16:52|123abcde_VB_GEN_Q...|
2                   |951a508c-d970-4d2...|What X should I N...|              female|2022-06-28CEST16:52|123abcde_VB_GEN_Q...|
3                   |951a508c-d970-4d2...|Please Share me t...|               72 kg|2022-06-28CEST16:53|123abcde_VB_GEN_Q...|
1                   |12345678-0987-4d2...|Give me the smth ...|               10 kg|2022-06-28CEST16:53|123abcde_VB_GEN_Q...|

Is it possible ? thanks

4
  • try row_number() Commented Aug 1, 2022 at 12:54
  • @mvasyliv hmm thanks, but how to make row_number per recordId ? Commented Aug 1, 2022 at 12:54
  • what col use to orderBy? Commented Aug 1, 2022 at 13:18
  • The row_number() is a window function in Spark SQL that assigns a row number (sequential integer number) to each row in the result DataFrame. This function is used with Window.partitionBy() which partitions the data into windows frames and orderBy() clause to sort the rows in each partition. Commented Aug 1, 2022 at 13:38

1 Answer 1

1
val w  = Window.partitionBy("recordId").orderBy("your col")
val resDF = sourceDF.withColumn("row_num", row_number.over(w))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.