4

Suppose there is a dateframe with a column comprised of dates as strings. For that assumption, we create the following dataFrame as an example:

# Importing sql types
from pyspark.sql.types import StringType, IntegerType, StructType, StructField, DoubleType, FloatType, DateType
from pyspark.sql.functions  import date_format
import random
import time

def strTimeProp(start, end, format, prop):
    stime = time.mktime(time.strptime(start, format)) # Parse a string representing a time according to a format
    etime = time.mktime(time.strptime(end, format)) 
    ptime = stime + prop * (etime - stime) 
    return time.strftime(format, time.localtime(ptime))

def randomDate(start, end, prop):
    return strTimeProp(start, end, '%m-%d-%Y', prop)

# Creación de un dataframe de prueba:
schema = StructType(
        [
     StructField("dates1", StringType(), True),
         StructField("dates2",  StringType(), True)
    ]
)

size = 32
numCol1 = [str(randomDate("1-1-1991", "1-1-1992", random.random())) for number in range(size)]
numCol2 = [str(randomDate("1-1-1991", "1-1-1992", random.random())) for number in range(size)]
# Building dataFrame:
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(list(zip(numCol1, numCol2)),schema=schema)
df.show(5)

In the code above, a random date column is generated, here is an example:

+----------+----------+
|    dates1|    dates2|
+----------+----------+
|12-21-1991|05-30-1991|
|05-28-1991|01-23-1991|
|03-01-1991|08-05-1991|
|07-15-1991|05-13-1991|
|07-21-1991|11-10-1991|
+----------+----------+

What I am trying to do is to change date format with the following code (from pySpark documentation):

# Changing date formats:
df.select(date_format('dates1', 'MM-dd-yyy').alias('newFormat')).show(5)

But I get this bad result:

+---------+
|newFormat|
+---------+
|     null|
|     null|
|     null|
|     null|
|     null|
+---------+

I suppose there is a problem relate with the string dataType but at same time, I don't understand why this code bellow works and the code above don't.

fechas = ['1000-01-01', '1000-01-15']
df = sqlContext.createDataFrame(list(zip(fechas, fechas)), ['dates', 'd'])
df.show()

# Changing date formats:
df.select(date_format('dates', 'MM-dd-yyy').alias('newFormat')).show()

Output:

+----------+----------+
|     dates|         d|
+----------+----------+
|1000-01-01|1000-01-01|
|1000-01-15|1000-01-15|
+----------+----------+

+----------+
| newFormat|
+----------+
|01-01-1000|
|01-15-1000|
+----------+

This last results is what I want.

1
  • should it be MM-dd-yyyy other than MM-dd-yyy? Commented Mar 18, 2016 at 22:38

2 Answers 2

12

It doesn't work because your data is not a valid ISO 8601 representation and cast to date returns NULL:

sqlContext.sql("SELECT CAST('12-21-1991' AS DATE)").show()
## +----+
## | _c0|
## +----+
## |null|
## +----+

You'll have to parse data first using custom format:

output_format = ...  # Some SimpleDateFormat string
df.select(date_format(
    unix_timestamp("dates1", "MM-dd-yyyy").cast("timestamp"), 
    output_format
))
Sign up to request clarification or add additional context in comments.

Comments

0

You can solve it by first casting your date from string to timestamp with the to_timestamp function available on pyspark.sql (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.to_timestamp.html)

# Changing date formats:
from pyspark.sql.functions import to_timestamp, date_format
df.select(date_format(to_timestamp('dates1', "MM-dd-yyyy"), "MM-dd-yyy").alias('newFormat')).show(5)


+----------+
| newFormat|
+----------+
|08-11-1991|
|11-11-1991|
|12-04-1991|
|03-30-1991|
|04-22-1991|
+----------+

You can even get, for example, only the month and day:

df.select(date_format(to_timestamp('dates1', "MM-dd-yyyy"), "MM-dd").alias('newFormat')).show(5)

 +---------+
|newFormat|
+---------+
|    08-11|
|    11-11|
|    12-04|
|    03-30|
|    04-22|
+---------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.