0

I want to find out Whether this array contains this date or not. if yes i need to put yes in one column.

Dataset<Row> dataset = dataset.withColumn("incoming_timestamp", col("incoming_timestamp").cast("timestamp"))
                .withColumn("incoming_date", to_date(col("incoming_timestamp")));

my incoming_timestamp is 2021-03-30 00:00:00 after converting to date it is 2021-03-30

output dataset is like this

+----------------------+-------------------+----------------------------------------+
|col 1                 |incoming_timestamp | incoming_date                          |
+----------------------+-------------------+-----------------------------------------
|val1                  |2021-03-30 00:00:00| 2021-07-06                             |
|val2                  |2020-03-30 00:00:00| 2020-03-30                             |
|val3                  |1889-03-30 00:00:00| 1889-03-30                             |
-------------------------------------------------------------------------------------   

i have a String declared like this,

String Dates = "2021-07-06,1889-03-30";

i want to add one more col in the result dataset is the incoming date is present in Dates String.

Like this,

+----------------------+-------------------+----------------------------------------+--------------+
|col 1                 |incoming_timestamp | incoming_date                          |      result  |
+----------------------+-------------------+--------------------------------------------------------
|val1                  |2021-03-30 00:00:00| 2021-07-06                             |  true        |
|val2                  |2020-03-30 00:00:00| 2020-03-30                             |  false       |
|val3                  |1889-03-30 00:00:00| 1889-03-30                             |  true        |
----------------------------------------------------------------------------------------------------

for that first i need to convert this String into Array, then array_contains(value,array) Returns true if the array contains the value.

i tried the following,

METHOD 1

DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd");
Date[] dateArr = Arrays.stream((dates.split(","))).map(d->(LocalDate.parse(d,
            formatter))).toArray(Date[]::new);
        
            it throws error, java.lang.ArrayStoreException: java.time.LocalDate

METHOD 2

SimpleDateFormat formatter = new SimpleDateFormat("YYYY-MM-DD", Locale.ENGLISH);
formatter.setTimeZone(TimeZone.getTimeZone("America/New_York"));

        Date[]  dateArr = Arrays.stream((Dates.split(","))).map(d-> {
                try {
                    return (formatter.parse(d));
                } catch (ParseException e) {
                    e.printStackTrace();
                }
                return null;
            }).toArray(Date[]::new);
            
    dataset = dataset.withColumn("result",array_contains(col("incoming_date"),dates));

it throws error

org.apache.spark.sql.AnalysisException: Unsupported component type class java.util.Date in arrays

Can anyone help on this?

1 Answer 1

1

This can be solved by typecasting String to java.sql.Date.

import java.sql.Date


    val data: Seq[(String, String)] = Seq(
      ("val1", "2020-07-31 00:00:00"),
      ("val2", "2021-02-28 00:00:00"),
      ("val3", "2019-12-31 00:00:00"))

    val compareDate = "2020-07-31, 2019-12-31"
    val compareDateArray = compareDate.split(",").map(x => Date.valueOf(x.trim))

    import spark.implicits._
    val df = data.toDF("variable", "date")
      .withColumn("date_casted", to_date(col("date"), "y-M-d H:m:s"))
    df.show()

    val outputDf = df.withColumn("result", col("date_casted").isin(compareDateArray: _*))
    outputDf.show()

Input:

+--------+-------------------+-----------+
|variable|               date|date_casted|
+--------+-------------------+-----------+
|    val1|2020-07-31 00:00:00| 2020-07-31|
|    val2|2021-02-28 00:00:00| 2021-02-28|
|    val3|2019-12-31 00:00:00| 2019-12-31|
+--------+-------------------+-----------+

root
 |-- variable: string (nullable = true)
 |-- date: string (nullable = true)
 |-- date_casted: date (nullable = true)

output:

+--------+-------------------+-----------+------+
|variable|               date|date_casted|result|
+--------+-------------------+-----------+------+
|    val1|2020-07-31 00:00:00| 2020-07-31|  true|
|    val2|2021-02-28 00:00:00| 2021-02-28| false|
|    val3|2019-12-31 00:00:00| 2019-12-31|  true|
+--------+-------------------+-----------+------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.