I would like to write a yeardiff function that works similarly to datediff. yeardiff should take two Column arguments and return a Column with the number of years between those parameter Columns.
Let's use the following sample data:
val testDf = Seq(
("2016-09-10", "2001-08-10"),
("2016-04-18", "2010-05-18"),
("2016-01-10", "2013-08-10")
)
.toDF("first_datetime", "second_datetime")
.withColumn("first_datetime", $"first_datetime".cast("timestamp"))
.withColumn("second_datetime", $"second_datetime".cast("timestamp"))
We can run this to get the date difference:
testDf.withColumn("num_days", datediff(col("first_datetime"), col("second_datetime")))
I want to be able to run this:
testDf.withColumn("num_years", yeardiff(col("first_datetime"), col("second_datetime")))
I tried to define a yeardiff function with the necessary method signature and didn't get anywhere:
def yeardiff(end: Column, start: Column): Column = {
// what do I do here
}
Here is a hacked transformation solution that I came up with and don't like:
def yearDiff(end: String, start: String)(df: DataFrame): DataFrame = {
val c = s"${end}_${start}_datediff"
df
.withColumn(c, datediff(col(end), col(start)))
.withColumn("yeardiff", col(c) / 365)
}
EDIT
I started digging into the Spark source code to see how datediff works. Here is the datediff function definition:
def datediff(end: Column, start: Column): Column = withExpr { DateDiff(end.expr, start.expr) }
Here is the DateDiff case class:
case class DateDiff(endDate: Expression, startDate: Expression)
extends BinaryExpression with ImplicitCastInputTypes {
override def left: Expression = endDate
override def right: Expression = startDate
override def inputTypes: Seq[AbstractDataType] = Seq(DateType, DateType)
override def dataType: DataType = IntegerType
override def nullSafeEval(end: Any, start: Any): Any = {
end.asInstanceOf[Int] - start.asInstanceOf[Int]
}
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
defineCodeGen(ctx, ev, (end, start) => s"$end - $start")
}
}