0

I am new to Spark and scala things. I am using an NLP package to analysis a book. The NLP method gives result spark.sql.DataFrame = [result: string] looks like:

|     Result|
| a, b, c, d|
|      e,f,g|

In order to calculate numbers every word appears, I want to separate a string into multiple rows like:

|Result|
|     a|
|     b|
|     c|
|   ...|

Or do you have any suggestions about how to do words count based on a DataFrame with structure [result: string]?

1 Answer 1

2

spark sql function explode will help in this use case.

Refer the below sample code; I also added the word count step:

import org.apache.spark.sql.functions._
import spark.implicits._

val rows = List(
  "a, b, c, d",
  "a,b,c,d",
  "e,f,g"
)
val df = spark.sparkContext.parallelize(rows).toDF("Result")
df.show()
//  +----------+
//  |    Result|
//  +----------+
//  |a, b, c, d|
//  |   a,b,c,d|
//  |     e,f,g|
//  +----------+

val allWords = df.select(explode(split(col("Result"), ",")).as("Result"))
allWords.show()
//  +------+
//  |Result|
//  +------+
//  |     a|
//  |     b|
//  |     c|
//  |     d|
//  |     a|
//  |     b|
//  |     c|
//  |     d|
//  |     e|
//  |     f|
//  |     g|
//  +------+

val countPerWord = allWords.groupBy(trim(col("Result")).alias("Result"))
  .agg(countDistinct("Result").alias("Count"))
countPerWord.show()
//  +------+-----+
//  |Result|Count|
//  +------+-----+
//  |     g|    1|
//  |     f|    1|
//  |     e|    1|
//  |     d|    2|
//  |     c|    2|
//  |     b|    2|
//  |     a|    2|
//  +------+-----+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.