0

I want to implement the following things
For example I have Emp file's (2 files) i want to select only 2 columns for example Empid and EmpName if file doesn't have EmpName it should select one column of Empid dataframe

1) Emp1.csv (File)

Empid   EmpName Dept
1       ABC     IS
2       XYZ     COE

2) Emp.csv (File)

 Empid  EmpName
 1      ABC
 2      XYZ

Code tried up till now

scala>  val SourceData = spark.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("delimiter", ",").option("header", "true").load("/root/Empfiles/")
SourceData: org.apache.spark.sql.DataFrame = [Empid: string, EmpName: string ... 1 more field]

scala> SourceData.printSchema
root
|-- Empid: string (nullable = true)
|-- EmpName: string (nullable = true)
|-- Dept: string (nullable = true)

This code works if specify all column names of file

 scala> var FormatedColumn = SourceData.select(
 |             SourceData.columns.map {
| case "Empid"                     => SourceData("Empid").cast(IntegerType).as("empid")
 | case "EmpName"                     => SourceData("EmpName").cast(StringType).as("empname")
 | case "Dept"                     => SourceData("Dept").cast(StringType).as("dept")
 | }: _*
 | )
 FormatedColumn: org.apache.spark.sql.DataFrame = [empid: int, empname: string ... 1 more field]

But i want only specific 2 columns it fails (if column is available it show select and change the datatype and column name)

 scala> var FormatedColumn = SourceData.select(
 | SourceData.columns.map {
 | case "Empid"                     => SourceData("Empid").cast(IntegerType).as("empid")
 | case "EmpName"                     => SourceData("EmpName").cast(StringType).as("empname")
 | }: _*
 | )
 scala.MatchError: Dept (of class java.lang.String)
 at $anonfun$1.apply(<console>:32)
 at $anonfun$1.apply(<console>:32)
 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  ... 53 elided
2
  • It throws a MatchError, because it can't find a matching case in your map function. If you add a default case it should succeed. You could also run the select on "Empid" and "EmpName" first and format the columns afterwads. Commented Aug 10, 2017 at 7:27
  • I am sorry i am new to scala in default case i don't want to do anything so what should i write ? Commented Aug 10, 2017 at 7:32

3 Answers 3

1

All other columns need to be matched too:

var formattedColumn = sourceData.select(
  sourceData.columns.map {
      case "Empid"   => sourceData("Empid").cast(IntegerType).as("empid")
      case "EmpName" => sourceData("EmpName").cast(StringType).as("empname")
      case other: String => sourceData(other)
  }: _*
)

Update 1. If you want to select only the two columns "Empid" and "EmpName", there is no need to use the matcher:

val formattedColumn = sourceData.select(
  sourceData("Empid").cast(IntegerType).as("empid"),
  sourceData("EmpName").cast(StringType).as("empname")
)

Update 2. If you want to select the columns depending on their existence, I can suggest the following:

val colEmpId = "Empid"
val colEmpName = "EmpName"
// list of possible expected column names
val selectableColums = Seq(colEmpId, colEmpName)
// take only the ones that are in the list
val foundColumns = sourceData.columns.filter(column => selectableColums.contains(column))
// create the target dataframe
val formattedColumn = sourceData.select(
  foundColumns.map(column =>
    column match {
      case colEmpId   => sourceData(colEmpId).cast(IntegerType).as("empid")
      case colEmpName => sourceData(colEmpName).cast(StringType).as("empname")
      case _ => throw new IllegalArgumentException("Unexpected column: " + column)
    }
  ): _*
)

p.s. please use conventional camelCase names for vals and vars.

Sign up to request clarification or add additional context in comments.

2 Comments

scala> FormattedColumn.printSchema root |-- empid: integer (nullable = true) |-- empname: string (nullable = true) |-- Dept: string (nullable = true) I dont want to select dept in formattedcolumn dataframe i only want 2 columns
no it doestn work if file doesn't have EmpName it should select one column of Empid dataframe it failes in this case
0

If you replace your statement with this query it should work. It filters out all Columns that aren't part of your match clause. This avoids the MatchError that you see.

df.select($"Empid", $"EmpName").select(df.columns.map {
    case "Empid" => df("Empid").cast(IntegerType).as("empid")
    case "EmpName" => df("EmpName").cast(StringType).as("empname")
}: _*)

4 Comments

No it doesnt work i have tried your code <console>:32: error: overloaded method value select with alternatives: cannot be applied to (Array[org.apache.spark.sql.Column]) df.select($"Empid", $"EmpName").select(df.columns.map {
I guess I was missing ": _*".
if file doesn't have EmpName it should select one column of Empid dataframe i dont think it satisfies this
Yes, you're right. It only avoids the MatchError that you see.
0

I'm not sure why it is this complicated..

Why not doing this?

df
  .withColumn("empid", $"EmpId".cast(IntegerType))
  .withColumn("empname", $"EmpName".cast(StringType))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.