How to transform a DataFrame[result : String] in scala into multiple rows?

Question

I am new to Spark and scala things. I am using an NLP package to analysis a book. The NLP method gives result spark.sql.DataFrame = [result: string] looks like:

|     Result|
| a, b, c, d|
|      e,f,g|

In order to calculate numbers every word appears, I want to separate a string into multiple rows like:

|Result|
|     a|
|     b|
|     c|
|   ...|

Or do you have any suggestions about how to do words count based on a DataFrame with structure [result: string]?

Pritish · Accepted Answer · 2019-09-23 05:30:48Z

spark sql function explode will help in this use case.

Refer the below sample code; I also added the word count step:

import org.apache.spark.sql.functions._
import spark.implicits._

val rows = List(
  "a, b, c, d",
  "a,b,c,d",
  "e,f,g"
)
val df = spark.sparkContext.parallelize(rows).toDF("Result")
df.show()
//  +----------+
//  |    Result|
//  +----------+
//  |a, b, c, d|
//  |   a,b,c,d|
//  |     e,f,g|
//  +----------+

val allWords = df.select(explode(split(col("Result"), ",")).as("Result"))
allWords.show()
//  +------+
//  |Result|
//  +------+
//  |     a|
//  |     b|
//  |     c|
//  |     d|
//  |     a|
//  |     b|
//  |     c|
//  |     d|
//  |     e|
//  |     f|
//  |     g|
//  +------+

val countPerWord = allWords.groupBy(trim(col("Result")).alias("Result"))
  .agg(countDistinct("Result").alias("Count"))
countPerWord.show()
//  +------+-----+
//  |Result|Count|
//  +------+-----+
//  |     g|    1|
//  |     f|    1|
//  |     e|    1|
//  |     d|    2|
//  |     c|    2|
//  |     b|    2|
//  |     a|    2|
//  +------+-----+

Collectives™ on Stack Overflow

How to transform a DataFrame[result : String] in scala into multiple rows?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related