0

I am new to both Spark and Scala, and I'm trying to practice the join command in Spark.

I have two csv files:

Ads.csv is

5de3ae82-d56a-4f70-8738-7e787172c018,AdProvider1
f1b6c6f4-8221-443d-812e-de857b77b2f4,AdProvider2
aca88cd0-fe50-40eb-8bda-81965b377827,AdProvider1
940c138a-88d3-4248-911a-7dbe6a074d9f,AdProvider3
983bb5e5-6d5b-4489-85b3-00e1d62f6a3a,AdProvider3
00832901-21a6-4888-b06b-1f43b9d1acac,AdProvider1
9a1786e1-ab21-43e3-b4b2-4193f572acbc,AdProvider1
50a78218-d65a-4574-90de-0c46affbe7f3,AdProvider5
d9bb837f-c85d-45d4-95f2-97164c62aa42,AdProvider4
611cf585-a8cf-43e9-9914-c9d1dc30dab5,AdProvider1

Impression.csv is:

5de3ae82-d56a-4f70-8738-7e787172c018,Publisher1
f1b6c6f4-8221-443d-812e-de857b77b2f4,Publisher2
aca88cd0-fe50-40eb-8bda-81965b377827,Publisher1
940c138a-88d3-4248-911a-7dbe6a074d9f,Publisher3
983bb5e5-6d5b-4489-85b3-00e1d62f6a3a,Publisher3
00832901-21a6-4888-b06b-1f43b9d1acac,Publisher1
9a1786e1-ab21-43e3-b4b2-4193f572acbc,Publisher1
611cf585-a8cf-43e9-9914-c9d1dc30dab5,Publisher1   

I want to join them with the first ID as the key and two values.

So I read them in like this:

val ads = sc.textFile("ads.csv")
ads: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile  at <console>:21
val impressions = sc.textFile("impressions.csv")
impressions: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21

Ok, so I have to make key,value pairs: val adPairs = ads.map(line => line.split(",")) val impressionPairs = impressions.map(line => line.split(","))

res11: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[6] at map at <console>:23
res13: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[7] at map at <console>:23

But I can't join them:

val result = impressionPairs.join(adPairs)
<console>:29: error: value join is not a member of       org.apache.spark.rdd.RDD[Array[String]]
val result = impressionPairs.join(adPairs)

Do I need to convert the pairs into another format?

1 Answer 1

3

You are almost there, but what you need is to transform the Array[String] into key-value pairs, like this:

val adPairs = ads.map(line => {
  val substrings = line.split(",")
  (substrings(0), substrings(1))
})

(and the same for impressionPairs)

That will give you rdds of type RDD[(String, String)] which can then be joined :)

Sign up to request clarification or add additional context in comments.

3 Comments

I get the error <console>:45: error: value split is not a member of (org.apache.spark.graphx.VertexId, String) val substrings = line.split(",")
My two arrays are in the form mVP: Array[(org.apache.spark.graphx.VertexId, Int)], cPV: Array[(org.apache.spark.graphx.VertexId, String)] and I wan't to join mVP and cPV by VertexId
The split operator works on type String - not on type (org.apache.spark.graphx.VertexId, String). But if you want to play with Spark, you should probably start by converting your arrays into RDDs with sc.parallelize

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.