0

I have a Spark Dataframe (Scala) with an id - (Int) and tokens - (array<string>) column:

id,tokens
0,["a","b","c"]
1,["a","b"]
...

Assuming I am able to retrieve the data via a SparkSession and casting to a case class:

case class Token(id: Int, tokens: Array[String])

After getting a Dataset[Token] object, how do I concatenate all the array of string tokens into a single Array<String> and subsequently perform a count to find the most occuring strings?

Output:

a,2
b,2
c,1
...

1 Answer 1

2

You need to explode the token column & take the count after grouping by the individual tokens:

scala> val input = sc.parallelize(List(
  (0, Array("a","b","c")), 
  (1, Array("a","b"))
)).toDF("id","token")

scala> input.withColumn("token_split",explode($"token"))
         .groupBy($"token_split")
         .agg(count($"id") as "count")
         .orderBy($"count".desc)
         .show

Output:

+-----------+-----+
|token_split|count|
+-----------+-----+
|          b|    2|
|          a|    2|
|          c|    1|
+-----------+-----+
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for the answer! To add on, we can use .count() instead of agg(count($"id") as "count") :)
@Ivan, yes thats true, even .count() works. I have explicitly used .agg() inorder to avoid confusions like this: stackoverflow.com/questions/52966347/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.