4

I would like to write a yeardiff function that works similarly to datediff. yeardiff should take two Column arguments and return a Column with the number of years between those parameter Columns.

Let's use the following sample data:

val testDf = Seq(
  ("2016-09-10", "2001-08-10"),
  ("2016-04-18", "2010-05-18"),
  ("2016-01-10", "2013-08-10")
)
  .toDF("first_datetime", "second_datetime")
  .withColumn("first_datetime", $"first_datetime".cast("timestamp"))
  .withColumn("second_datetime", $"second_datetime".cast("timestamp"))

We can run this to get the date difference:

testDf.withColumn("num_days", datediff(col("first_datetime"), col("second_datetime")))

I want to be able to run this:

testDf.withColumn("num_years", yeardiff(col("first_datetime"), col("second_datetime")))

I tried to define a yeardiff function with the necessary method signature and didn't get anywhere:

def yeardiff(end: Column, start: Column): Column = {
  // what do I do here
}    

Here is a hacked transformation solution that I came up with and don't like:

def yearDiff(end: String, start: String)(df: DataFrame): DataFrame = {
  val c = s"${end}_${start}_datediff"
  df
    .withColumn(c, datediff(col(end), col(start)))
    .withColumn("yeardiff", col(c) / 365)
}

EDIT

I started digging into the Spark source code to see how datediff works. Here is the datediff function definition:

def datediff(end: Column, start: Column): Column = withExpr { DateDiff(end.expr, start.expr) }

Here is the DateDiff case class:

case class DateDiff(endDate: Expression, startDate: Expression)
  extends BinaryExpression with ImplicitCastInputTypes {

  override def left: Expression = endDate
  override def right: Expression = startDate
  override def inputTypes: Seq[AbstractDataType] = Seq(DateType, DateType)
  override def dataType: DataType = IntegerType

  override def nullSafeEval(end: Any, start: Any): Any = {
    end.asInstanceOf[Int] - start.asInstanceOf[Int]
  }

  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
    defineCodeGen(ctx, ev, (end, start) => s"$end - $start")
  }
}

2 Answers 2

2

This may solve your problem:

def yearDiff(end: Column, start: Column): Column = {
  datediff(end, start)/365
}
Sign up to request clarification or add additional context in comments.

1 Comment

This is a good work-around, but you need to change the argument order to datediff(end, start)/365.
0

We can use the built in year function and a udf to adjust for cases where the month has not past.

            def yeardiff(end: Column, start: Column): Column = {
              def getAdjustment(monthStart : Int, monthEnd : Int, dayStart : Int, dayEnd : Int) : Int = {
                if (monthEnd>monthStart) return -1
                if (monthStart==monthEnd && dayEnd > dayStart) return -1
                else return 0
              }
              val udfGetAdjustment =  udf[Int,Int,Int,Int,Int](getAdjustment)
              val adj = udfGetAdjustment(month(start),month(end),dayofmonth(start),dayofmonth(end))
              year(end)  - year(start) + adj
            }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.