0

I have following data

df.show

+---------+--------------------+------------+
|     name| age|         tokens| tokensCount|
+---------+----+---------------+------------+
|    Alice|  29|        [A,B,C]|           3|
|      Bob|  28|      [A,B,C,D]|           4|
|  Charlie|  29|    [A,B,C,D,E]|           5|    
+---------+----+---------------+------------+

I transform data with following command

val newDF = df.select(($"name") +: (0 until 4).map(i => ($"tokens")(i).alias(s"token$i")): _*).show

+---------+-------+-------+-------+-------+
|     name| token0| token1| token2| token3| 
+---------+-------+-------+-------+-------+
|    Alice|      A|      B|      C|   null|
|      Bob|      A|      B|      C|      D|
|  Charlie|      A|      B|      C|      D|
+---------+-------+-------+-------+-------+

I want to give tokensCount instead of static value 4 at (0 until 4)

I tried a few things like $"tokensCount" and size($"tokens"), but could not get through.

Can anyone suggest how to loop or map according to the size of array or count of array ?

Many thanks

1 Answer 1

2

You can modify your code to find the maximum length of tokens, and then use that to create the necessary columns:

val df = Seq(
  ("Alice", 29, Array("A", "B", "C")),
  ("Bob", 28, Array("A", "B", "C", "D")),
  ("Charlie", 29, Array("A", "B", "C", "D", "E")),
).toDF("name", "age", "tokens")

val maxTokenCount = df.withColumn("token_count", size(col("tokens"))).agg(max("token_count") as "mtc")

val newDF = df.select(($"name") +: (0 until maxTokenCount.first.getInt(0)).map(i => ($"tokens")(i).alias(s"token$i")): _*).show

Which will give you:

+-------+------+------+------+------+------+
|   name|token0|token1|token2|token3|token4|
+-------+------+------+------+------+------+
|  Alice|     A|     B|     C|  null|  null|
|    Bob|     A|     B|     C|     D|  null|
|Charlie|     A|     B|     C|     D|     E|
+-------+------+------+------+------+------+

It might be useful to explain why you want to do this transformation, as there might be a much more efficient way. This has the potential to create a very sparse dataframe. Imagine that most names have no tokens, but Bob has 100 tokens: all of a sudden you have one hundred columns of mostly null values.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot. Your comment is genuine. One issue resolved that gave birth to new one :) I have mtc = 147 and there are several tuples, where i have null, which is not a desired transformation for the input where this DataFrame have to go. I will think on it and write a comment to explain a bit more tomorrow. If possible, give a remark/comment. Many thanks
Infact, i want to build a binary table for a given set of tokens(e g., A, B,C,D,E). For a specific name, if there exists a token, it should get 1, else 0. For that i have made a schema with nested StructType object.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.