I have a Spark Dataframe (Scala) with an id - (Int) and tokens - (array<string>) column:
id,tokens
0,["a","b","c"]
1,["a","b"]
...
Assuming I am able to retrieve the data via a SparkSession and casting to a case class:
case class Token(id: Int, tokens: Array[String])
After getting a Dataset[Token] object, how do I concatenate all the array of string tokens into a single Array<String> and subsequently perform a count to find the most occuring strings?
Output:
a,2
b,2
c,1
...