Spark join array

Question

I am brand new to spark (hours) and additionally rather inexperienced with Scala. However, I have long standing desire to become more familiar with both.

I have a rather trivial taks. I have two dataframes that I am importing from two JSON files. One with an uuid,text,tag_ids and the other with the tags id,term I would like to produce a new json file that I can import into solr that contains the uuid,text,tag_ids,tag_terms.

val text = spark.sqlContext.jsonFile("/tmp/text.js")
val tags = spark.sqlContext.jsonFile("/tmp/tags.js")



text.printSchema()

root
| -- uuid: string (nullable = true)
| -- tag_ids: array (nullable = true)
|    | -- element: string (contiansNull = true)
| -- text: string (nullable = true)

tags.printSchema()
root
| -- id: string (nullable = true)
| -- term: string (nullable = true)


#desired output  
+--------------------+------+---------+------------+
|                uuid| text | tag_ids |   tag_terms|  
+--------------------+------+---------+------------+    
|cf5c1f4c-96e6-4ca...| foo  |    [1,2]| [tag1,tag2]|      
|c9834e2e-0f04-486...| bar  |    [2,3]| [tag2,tag3]|   
+--------------------+--------------+--------------+

It is difficult to show all I have been trying. Essentially .join() is having issues with tag_ids being an array. I can explode() tag_ids and join on tag_terms but reassembling it into a new df to export is still beyond my level.

Tzach Zohar · Accepted Answer · 2017-10-02 20:21:21Z

4

Solution using explode:

val result = text
  .withColumn("tag_id", explode($"tag_ids"))
  .join(tags,  $"tag_id" === $"id")
  .groupBy("uuid", "tag_ids")
  .agg(first("text") as "text", collect_list("term") as "tag_terms")

answered Oct 2, 2017 at 20:21

Tzach Zohar

37.9k3 gold badges83 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

matchew Over a year ago

Thanks This worked. In my actual data I had a few more columns, and some rows had null values or no actual tag_ids. This was easily solved with a left join.

vaquar khan · Accepted Answer · 2017-10-02 21:36:14Z

1

Try this :

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}

import spark.implicits._

val text = spark.sqlContext.jsonFile("/tmp/text.js")
val tags = spark.sqlContext.jsonFile("/tmp/tags.js")

 val df1 = spark.sparkContext.parallelize(text, 4).toDF()
 val df2 = spark.sparkContext.parallelize(tags, 4).toDF()

 df1.createOrReplaceTempView("A")
 df2.createOrReplaceTempView("B")


spark.sql("select d1.key,d1.value,d2.value1  from A d1  inner join B d2 on d1.key = d2.key").show()

answered Oct 2, 2017 at 21:36

vaquar khan

11.5k7 gold badges81 silver badges103 bronze badges

Collectives™ on Stack Overflow

Spark join array

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related