Joining two RDD[String] -Spark Scala

Question

I have two RDDS :

rdd1 [String,String,String]: Name, Address, Zipcode
rdd2 [String,String,String]: Name, Address, Landmark

I am trying to join these 2 RDDs using the function : rdd1.join(rdd2)
But I am getting an error :
error: value fullOuterJoin is not a member of org.apache.spark.rdd.RDD[String]

The join should join the RDD[String] and the output RDD should be something like :

rddOutput : Name,Address,Zipcode,Landmark

And I wanted to save these files as a JSON file in the end.

Can someone help me with the same ?

join is defined on pair RDD, so your rdd1 is not of type RDD[(String, T)] . You should map it, like this rdd1.map(v => (v, 1)) (or to another tuple, it depends on your task). If you explain your goal in more details (what you expect to get from the join), you may get more help. — Vitalii Kotliarenko
– Vitalii Kotliarenko, Commented May 11, 2016 at 17:22
@VitaliyKotlyarenko : Sorry for not clarifying it earlier. I just edited the question. Can you please help me with that ? — user2122466
– user2122466, Commented May 11, 2016 at 17:38
Your edit doesnt' help much. You don't have a RDD[String], but two RDD[String, String, String]. Which field(s) do you want to join on? Name and Address, or just one of those? You need to change the RDDs to have entries that are tuples where the first of the pair is the key, and the rest is the value, then join will work. — The Archetypal Paul
– The Archetypal Paul, Commented May 11, 2016 at 17:57

Daniel de Paula · Accepted Answer · 2016-05-11 21:39:22Z

7

As said in the comments, you have to convert your RDDs to PairRDDs before joining, which means that each RDD must be of type RDD[(key, value)]. Only then you can perform the join by the key. In your case, the key is composed by (Name, Address), so you you would have to do something like:

// First, we create the first PairRDD, with (name, address) as key and zipcode as value:
val pairRDD1 = rdd1.map { case (name, address, zipcode) => ((name, address), zipcode) }
// Then, we create the second PairRDD, with (name, address) as key and landmark as value:
val pairRDD2 = rdd2.map { case (name, address, landmark) => ((name, address), landmark) }

// Now we can join them. 
// The result will be an RDD of ((name, address), (zipcode, landmark)), so we can map to the desired format:
val joined = pairRDD1.fullOuterJoin(pairRDD2).map { 
  case ((name, address), (zipcode, landmark)) => (name, address, zipcode, landmark) 
}

More info about PairRDD functions in the Spark's Scala API documentation

edited May 11, 2016 at 21:39

answered May 11, 2016 at 17:55

Daniel de Paula

18k9 gold badges75 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user2122466 Over a year ago

Hello @Daniel : I am working on the RDD earlier by using this function, to come up with the output RDD1. Is there any way i can tweak this ? : i.map(r => { r._1 + "|" + r._2 + "|" + r._3+ "|" + sys.extractDocument(r._3) });

The Archetypal Paul Over a year ago

Yes, make the result of your map be a (key, value) pair. This has been said several times, maybe you could take note of it?

Daniel de Paula Over a year ago

@user2122466, sorry, I didn't understand your comment. Maybe you could improve your question?

deFreitas Over a year ago

@DanieldePaula .map does not result to a PairRDD and yes to a RDD, are you talking about .mapToPair?

Daniel de Paula Over a year ago

@deFreitas In Scala, an RDD of pairs is implicitly converted to PairRDD, so there's no need to call mapToPair, just a normal map returning a pair. mapToPair is needed if you are using the Java API.

|

Collectives™ on Stack Overflow

Joining two RDD[String] -Spark Scala

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related