0

I have a table with a column 'date' and the date format is yyyyMMdd. I need to filter this dataframe and return a dataframe with only rows with dates greater than an input, For eg: Return all the rows where date is greater than "20180715". I did the following.

scala> df.groupBy("date").count.show(50,false)  
+--------+----------+                                                              
|date    |count     |  
+--------+----------+  
|20180707|200       |  
|20180715|1429586969| 
|20180628|1425490080| 
|20180716|1429819708|  
+--------+----------+ 

scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("20180715")))

scala> con.count
res4: Long = 0

scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("20170715")))

scala> con.count
res1: Long = 4284896957 

When I input the date as "20170715", it counts all the records, whereas if the date is "20180715", the filter condition does not work. What is the correct way to compare with a string date.

2
  • scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("2018-07-15"))) After changing the format of input, it worked scala> con.count res6: Long = 1429819708 Commented Jul 19, 2018 at 4:29
  • If column date is of string type, you can simply do df.filter($"date" > "20180715") since a date string with format "yyyyMMdd" would be in the suitable order for string comparison. Commented Jul 19, 2018 at 4:35

1 Answer 1

1

Changing the format of the input string passed to the lit function, solved this issue.

scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("2018-07-15"))) 

scala> con.count 
res6: Long = 1429819708
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.