How to explode array[string] field and group data for one pass

Question

I'm new in scala and spark and don't know how to explode "path" field and found max and min "event_dttm" field for one pass. I have a data:

val weblog=sc.parallelize(Seq(
  ("39f0412b4c91","staticnavi.com", Seq( "panel", "cm.html" ), 1424954530, "SO.01"),
  ("39f0412b4c91","staticnavi.com", Seq( "panel", "cm.html" ), 1424964830, "SO.01"),
  ("39f0412b4c91","staticnavi.com", Seq( "panel", "cm.html" ), 1424978445, "SO.01"),
   )).toDF("id","domain","path","event_dttm","load_src")

I must to get next result:

"id"        |   "domain"   |"newPath" | "max_time" | min_time   | "load_src"
39f0412b4c91|staticnavi.com|  panel   | 1424978445 | 1424954530 | SO.01
39f0412b4c91|staticnavi.com|  cm.html | 1424978445 | 1424954530 | SO.01

I think it's possible realize via row function, but don't know how.

mtoto · Accepted Answer · 2017-04-05 18:17:44Z

1

You are looking for explode(), followed by a groupBy aggregation:

import org.apache.spark.sql.functions.{explode, min, max}

var result = weblog.withColumn("path", explode($"path"))
  .groupBy("id","domain","path","load_src")
  .agg(min($"event_dttm").as("min_time"),
       max($"event_dttm").as("max_time"))

result.show()
+------------+--------------+-------+--------+----------+----------+
|          id|        domain|   path|load_src|  min_time|  max_time|
+------------+--------------+-------+--------+----------+----------+
|39f0412b4c91|staticnavi.com|  panel|   SO.01|1424954530|1424978445|
|39f0412b4c91|staticnavi.com|cm.html|   SO.01|1424954530|1424978445|
+------------+--------------+-------+--------+----------+----------+

answered Apr 5, 2017 at 18:17

mtoto

24.3k4 gold badges62 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Fred Over a year ago

Thanks! Works fine. Is there another way w/o using explode?

mtoto Over a year ago

using the rdd api, but that's going to be more elaborate and potentially slower.

Fred Over a year ago

I've found solution with flatMap: val result = weblog.flatMap { case Row(id: String, domain: String, path: String, event_dttm: Long, load_src: String, ymd: String) => { path.split("/").map(x => (id, domain.concat("#").concat(x), BigInt(event_dttm), load_src, ymd)) }}

Collectives™ on Stack Overflow

How to explode array[string] field and group data for one pass

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related