Scala concatenate Column of Array[String] into single Array[String]

Question

I have a Spark Dataframe (Scala) with an id - (Int) and tokens - (array<string>) column:

id,tokens
0,["a","b","c"]
1,["a","b"]
...

Assuming I am able to retrieve the data via a SparkSession and casting to a case class:

case class Token(id: Int, tokens: Array[String])

After getting a Dataset[Token] object, how do I concatenate all the array of string tokens into a single Array<String> and subsequently perform a count to find the most occuring strings?

Output:

a,2
b,2
c,1
...

vdep · Accepted Answer · 2018-12-12 06:30:00Z

2

You need to explode the token column & take the count after grouping by the individual tokens:

scala> val input = sc.parallelize(List(
  (0, Array("a","b","c")), 
  (1, Array("a","b"))
)).toDF("id","token")

scala> input.withColumn("token_split",explode($"token"))
         .groupBy($"token_split")
         .agg(count($"id") as "count")
         .orderBy($"count".desc)
         .show

Output:

+-----------+-----+
|token_split|count|
+-----------+-----+
|          b|    2|
|          a|    2|
|          c|    1|
+-----------+-----+

edited Dec 12, 2018 at 6:30

answered Dec 12, 2018 at 6:24

vdep

3,6005 gold badges30 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ivan Over a year ago

Thank you for the answer! To add on, we can use .count() instead of agg(count($"id") as "count") :)

vdep Over a year ago

@Ivan, yes thats true, even .count() works. I have explicitly used .agg() inorder to avoid confusions like this: stackoverflow.com/questions/52966347/…

Collectives™ on Stack Overflow

Scala concatenate Column of Array[String] into single Array[String]

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related