Spark Dataframe select column using case

Question

I want to implement the following things
For example I have Emp file's (2 files) i want to select only 2 columns for example Empid and EmpName if file doesn't have EmpName it should select one column of Empid dataframe

1) Emp1.csv (File)

Empid   EmpName Dept
1       ABC     IS
2       XYZ     COE

2) Emp.csv (File)

 Empid  EmpName
 1      ABC
 2      XYZ

Code tried up till now

scala>  val SourceData = spark.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("delimiter", ",").option("header", "true").load("/root/Empfiles/")
SourceData: org.apache.spark.sql.DataFrame = [Empid: string, EmpName: string ... 1 more field]

scala> SourceData.printSchema
root
|-- Empid: string (nullable = true)
|-- EmpName: string (nullable = true)
|-- Dept: string (nullable = true)

This code works if specify all column names of file

 scala> var FormatedColumn = SourceData.select(
 |             SourceData.columns.map {
| case "Empid"                     => SourceData("Empid").cast(IntegerType).as("empid")
 | case "EmpName"                     => SourceData("EmpName").cast(StringType).as("empname")
 | case "Dept"                     => SourceData("Dept").cast(StringType).as("dept")
 | }: _*
 | )
 FormatedColumn: org.apache.spark.sql.DataFrame = [empid: int, empname: string ... 1 more field]

But i want only specific 2 columns it fails (if column is available it show select and change the datatype and column name)

 scala> var FormatedColumn = SourceData.select(
 | SourceData.columns.map {
 | case "Empid"                     => SourceData("Empid").cast(IntegerType).as("empid")
 | case "EmpName"                     => SourceData("EmpName").cast(StringType).as("empname")
 | }: _*
 | )
 scala.MatchError: Dept (of class java.lang.String)
 at $anonfun$1.apply(<console>:32)
 at $anonfun$1.apply(<console>:32)
 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  ... 53 elided

It throws a MatchError, because it can't find a matching case in your map function. If you add a default case it should succeed. You could also run the select on "Empid" and "EmpName" first and format the columns afterwads. — Steffen Schmitz
– Steffen Schmitz, Commented Aug 10, 2017 at 7:27
I am sorry i am new to scala in default case i don't want to do anything so what should i write ? — Bhavesh
– Bhavesh, Commented Aug 10, 2017 at 7:32

Antot · Accepted Answer · 2017-08-10 08:20:00Z

1

All other columns need to be matched too:

var formattedColumn = sourceData.select(
  sourceData.columns.map {
      case "Empid"   => sourceData("Empid").cast(IntegerType).as("empid")
      case "EmpName" => sourceData("EmpName").cast(StringType).as("empname")
      case other: String => sourceData(other)
  }: _*
)

Update 1. If you want to select only the two columns "Empid" and "EmpName", there is no need to use the matcher:

val formattedColumn = sourceData.select(
  sourceData("Empid").cast(IntegerType).as("empid"),
  sourceData("EmpName").cast(StringType).as("empname")
)

Update 2. If you want to select the columns depending on their existence, I can suggest the following:

val colEmpId = "Empid"
val colEmpName = "EmpName"
// list of possible expected column names
val selectableColums = Seq(colEmpId, colEmpName)
// take only the ones that are in the list
val foundColumns = sourceData.columns.filter(column => selectableColums.contains(column))
// create the target dataframe
val formattedColumn = sourceData.select(
  foundColumns.map(column =>
    column match {
      case colEmpId   => sourceData(colEmpId).cast(IntegerType).as("empid")
      case colEmpName => sourceData(colEmpName).cast(StringType).as("empname")
      case _ => throw new IllegalArgumentException("Unexpected column: " + column)
    }
  ): _*
)

p.s. please use conventional camelCase names for vals and vars.

edited Aug 10, 2017 at 8:20

answered Aug 10, 2017 at 7:30

Antot

3,9641 gold badge25 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Bhavesh Over a year ago

scala> FormattedColumn.printSchema root |-- empid: integer (nullable = true) |-- empname: string (nullable = true) |-- Dept: string (nullable = true) I dont want to select dept in formattedcolumn dataframe i only want 2 columns

Bhavesh Over a year ago

no it doestn work if file doesn't have EmpName it should select one column of Empid dataframe it failes in this case

Steffen Schmitz · Accepted Answer · 2017-08-10 07:44:06Z

0

If you replace your statement with this query it should work. It filters out all Columns that aren't part of your match clause. This avoids the MatchError that you see.

df.select($"Empid", $"EmpName").select(df.columns.map {
    case "Empid" => df("Empid").cast(IntegerType).as("empid")
    case "EmpName" => df("EmpName").cast(StringType).as("empname")
}: _*)

answered Aug 10, 2017 at 7:44

Steffen Schmitz

8903 gold badges18 silver badges34 bronze badges

4 Comments

Bhavesh Over a year ago

No it doesnt work i have tried your code <console>:32: error: overloaded method value select with alternatives: cannot be applied to (Array[org.apache.spark.sql.Column]) df.select($"Empid", $"EmpName").select(df.columns.map {

Steffen Schmitz Over a year ago

I guess I was missing ": _*".

Bhavesh Over a year ago

if file doesn't have EmpName it should select one column of Empid dataframe i dont think it satisfies this

Steffen Schmitz Over a year ago

Yes, you're right. It only avoids the MatchError that you see.

Michel Lemay · Accepted Answer · 2017-08-10 19:25:09Z

0

I'm not sure why it is this complicated..

Why not doing this?

df
  .withColumn("empid", $"EmpId".cast(IntegerType))
  .withColumn("empname", $"EmpName".cast(StringType))

answered Aug 10, 2017 at 19:25

Michel Lemay

2,0942 gold badges18 silver badges35 bronze badges

Collectives™ on Stack Overflow

Spark Dataframe select column using case

3 Answers 3

2 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related