0

I have a dataframe which has two columns in it, has been created importing a .txt file.

sample file content::

Sankar Biswas, Played{"94"}
Puja "Kumari" Jha, Didnot
Man Women, null
null,Gay Gentleman
null,null

Created a dataframe importing the above file ::

val a = sc.textFile("file:////Users/sankar.biswas/Desktop/hello.txt")

case class Table(contentName: String, VersionDetails: String)

val b = a.map(_.split(",")).map(p => Table(p(0).trim,p(1).trim)).toDF

Now I have a function defined lets say like this ::

  def getFormattedName(contentName : String, VersionDetails:String): Option[String] = {
    Option(contentName+titleVersionDesc)
  }

Now what I need to do is I have to take each row of the dataframe and call the method getFormattedName passing the 2 arguments of the dataframe's each row.

I tried like this and many others but did not work out ::

val a = b.map((m,n) => getFormattedContentName(m,n))

Looking forward to any suggestion you have for me. Thanks in advance.

1
  • if you're planning to perform higher order functions like map & filter, I would suggest you to use a Dataset instead. Also, I would use the DataFrameReader API for reading your csv as Constantine already suggested - BTW, you can derive an schema from a case class, and you can cast a DataFrame to a Dataset[T] where T is a case class, given that you can pattern match against your case class like a tuple. Also you can use tuples if you want. Commented Jan 4, 2019 at 14:49

2 Answers 2

1

I think you have a structured schema and it can be represented by a dataframe. Dataframe has support for reading the csv input.

import org.apache.spark.sql.types._
val customSchema = StructType(Array(StructField("contentName", StringType, true),StructField("titleVersionDesc", StringType, true)))

val df = spark.read.schema(customSchema).csv("input.csv")

To call a custom method on dataset, you can create a UDF(User Defined Function).

def getFormattedName(contentName : String, titleVersionDesc:String): Option[String] = {
    Option(contentName+titleVersionDesc)
  }

val get_formatted_name = udf(getFormattedName _)

df.select(get_formatted_name($"contentName", $"titleVersionDesc"))
Sign up to request clarification or add additional context in comments.

Comments

0

Try

val a = b.map(row => getFormattedContentName(row(0),row(1)))

Remember that the rows of a dataframe are their own type, not a tuple or something, and you need to use the correct methodology for referring to their elements.

1 Comment

The apply method in Row returns Any, you would have to cast the values after to pass them to the method. I would use row.getAs[String](0) instead.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.