The question is reframed by giving more details.
I have a dataframe "dailyshow" Schema is:
scala> dailyshow.printSchema
root
|-- year: integer (nullable = true)
|-- occupation: string (nullable = true)
|-- showdate: string (nullable = true)
|-- group: string (nullable = true)
|-- guest: string (nullable = true)
Sample Data is:
scala> dailyshow.show(5)
+----+------------------+---------+------+----------------+
|year| occupation| showdate| group| guest|
+----+------------------+---------+------+----------------+
|1999| actor|1/11/1999|Acting| Michael J. Fox|
|1999| Comedian|1/12/1999|Comedy| Sandra Bernhard|
|1999|television actress|1/13/1999|Acting| Tracey Ullman|
|1999| film actress|1/14/1999|Acting|Gillian Anderson|
|1999| actor|1/18/1999|Acting|David Alan Grier|
+----+------------------+---------+------+----------------+
Below code is used to transform and generate results which return the top 5 occupations between the time period "01/11/1999" and "06/11/1999"
scala> dailyshow.
withColumn("showdate",to_date(unix_timestamp(col("showdate"),"MM/dd/yyyy").
cast("timestamp"))).
where((col("showdate") >= "1999-01-11") and (col("showdate") <= "1999-06-11")).
groupBy(col("occupation")).agg(count("*").alias("count")).
orderBy(desc("count")).
limit(5).show
+------------------+-----+
| occupation|count|
+------------------+-----+
| actor| 29|
| actress| 20|
| comedian| 4|
|television actress| 3|
| stand-up comedian| 2|
+------------------+-----+
My question is how to code and get the same result when using RDD?
scala> dailyshow.first
res12: org.apache.spark.sql.Row = [1999,actor,1/11/1999,Acting,Michael J. Fox]
I used SimpleDateFormat to parse the string to date in a DataFrame.
Below is the code:
val format = new java.text.SimpleDateFormat("MM/dd/yyyy")
dailyshow.
map(x => x.mkString(",")).
map(x => x.split(",")).
map(x => format.parse(x(2))).first // returns Mon Jan 11 00:00:00 PST 1999
dailyshowcontain? And your first two maps look like they cancel each other.datashowand expected output will be of great help to answerers. Please add if you can, thanks.