5

Spark Version: spark-2.0.1-bin-hadoop2.7 Scala: 2.11.8

I am loading a raw csv into a DataFrame. In csv, although the column is support to be in date format, they are written as 20161025 instead of 2016-10-25. The parameter date_format includes string of column names that need to be converted to yyyy-mm-dd format.

In the following code, I first loaded the csv of Date column as StringType via the schema, and then I check if the date_format is not empty, that is there are columns that need to be converted to Date from String, then cast each column using unix_timestamp and to_date. However, in the csv_df.show(), the returned rows are all null.

def read_csv(csv_source:String, delimiter:String, is_first_line_header:Boolean, 
    schema:StructType, date_format:List[String]): DataFrame = {
    println("|||| Reading CSV Input ||||")

    var csv_df = sqlContext.read
        .format("com.databricks.spark.csv")
        .schema(schema)
        .option("header", is_first_line_header)
        .option("delimiter", delimiter)
        .load(csv_source)
    println("|||| Successfully read CSV. Number of rows -> " + csv_df.count() + " ||||")
    if(date_format.length > 0) {
        for (i <- 0 until date_format.length) {
            csv_df = csv_df.select(to_date(unix_timestamp(
                csv_df(date_format(i)), "yyyy-­MM-­dd").cast("timestamp")))
            csv_df.show()
        }
    }
    csv_df
}

Returned Top 20 rows:

+-------------------------------------------------------------------------+
|to_date(CAST(unix_timestamp(prom_price_date, YYYY-­MM-­DD) AS TIMESTAMP))|
+-------------------------------------------------------------------------+
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
|                                                                     null|
+-------------------------------------------------------------------------+

Why am I getting all null?

1 Answer 1

11

To convert yyyyMMdd to yyyy-MM-dd you can:

spark.sql("""SELECT DATE_FORMAT(
  CAST(UNIX_TIMESTAMP('20161025', 'yyyyMMdd') AS TIMESTAMP), 'yyyy-MM-dd'
)""")

with functions:

date_format(unix_timestamp(col, "yyyyMMdd").cast("timestamp"), "yyyy-MM-dd")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.