Array[Array[String]] to String in a column with Scala and Spark

Question

This is my Dataframe:

+--------------------+                          
|    NewsId|             newsArr|            transArr|
+----------+--------------------+--------------------+
|        26|[Republicans, Sto...|[[R, IH0, P, AH1,...|
|        29|[ISIS, Claims, Re...|[[AY1, S, AH0], [...|
|       474|[Concert, for, Tr...|[[K, AA1, N, S, E...|
|       964|[How, a, Fractiou...|[[HH, AW1], [AH0]...|
|      1677|[Review:, ‘Kong:,...|[[n/a], [n/a], [S...|
|      1697|[The, Rice-Size, ...|[[DH, AH0], [n/a]...|
|      1806|[Populists, Appea...|[[P, AA1, Y, AH0,...|
|      1950|[Uber, Board, Sta...|[[Y, UW1, B, ER0]...|
|      2040|[Health, Bill’s, ...|[[HH, EH1, L, TH]...|
|      2214|[Unmasking, the, ...|[[n/a], [DH, AH0]...|

I want to make the "transArr" column cells into strings like this:

+--------------------+                          
|    NewsId|             newsArr|      transArr|
+----------+--------------------+--------------+
|        26|[Republicans, Sto...|R IH0 P AH1...|
|        29|[ISIS, Claims, Re...|AY1 S AH0...  |
|       474|[Concert, for, Tr...|K AA1 N S E...|
|       964|[How, a, Fractiou...|HH AW1 AH0... |
|      1677|[Review:, ‘Kong:,...|n/a n/a S...  |
|      1697|[The, Rice-Size, ...|DH AH0 n/a... |
|      1806|[Populists, Appea...|P AA1 Y AH0...|
|      1950|[Uber, Board, Sta...|Y UW1 B ER0...|
|      2040|[Health, Bill’s, ...|HH EH1 L TH...|
|      2214|[Unmasking, the, ...|n/a DH AH0... |

Is there an relatively easy solution to this?

s.polam · Accepted Answer · 2020-10-20 11:52:52Z

3

Use concat_ws & flatten, Check below code.

scala> df.printSchema
root
 |-- data: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

scala> df
.withColumn(
     "flatten",
     concat_ws(" ",flatten($"data"))
)
.show(false)

+------------+-------+
|data        |flatten|
+------------+-------+
|[[abc, cdf]]|abc cdf|
+------------+-------+

answered Oct 20, 2020 at 11:52

s.polam

10.4k2 gold badges17 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Boris Azanov Over a year ago

I got the same result without flatten, try to use only concat_ws and wrap array column into col.

Boris Azanov Over a year ago

sure, I had posted my sample

Boris Azanov · Accepted Answer · 2020-10-20 12:03:14Z

0

using concat_ws:

import spark.implicits._
val df: DataFrame = Seq(
  ("a1", Array("2", "3", "5")),
  ("b2", Array("1", "6", "23")),
  ("b1", Array("df", "l2", "14")),
  ("c1", Array("te", "3pa", "gw"))
).toDF("key", "values")
df.show()
val newDF = df.withColumn("values", concat_ws(" ", col("values")))
newDF.show()
newDF.printSchema()

output:

+---+-------------+
|key|       values|
+---+-------------+
| a1|    [2, 3, 5]|
| b2|   [1, 6, 23]|
| b1| [df, l2, 14]|
| c1|[te, 3pa, gw]|
+---+-------------+

+---+---------+
|key|   values|
+---+---------+
| a1|    2 3 5|
| b2|   1 6 23|
| b1| df l2 14|
| c1|te 3pa gw|
+---+---------+

root
 |-- key: string (nullable = true)
 |-- values: string (nullable = false)

answered Oct 20, 2020 at 12:03

Boris Azanov

4,5011 gold badge19 silver badges31 bronze badges

2 Comments

s.polam Over a year ago

your values is of type Array[String] not Array[Array[String]]

Boris Azanov Over a year ago

@Srinivas Yes, you are right. I missed one pair of brackets. In that case using flatten is required.

Collectives™ on Stack Overflow

Array[Array[String]] to String in a column with Scala and Spark

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related