0

I have a csv file which is "semi-structured"

canal,username,email,age
facebook,pepe22,[email protected],24
twitter,foo-24,[email protected],33
facebook,caty24,,22

suppose that i want the first column the second and the third column into an RDD org.apache.spark.rdd.RDD[(String, String, String)]

I am realy new, im using spark 1.4.1 ,my code reach here

val rdd = sc.textFile("/user/ergorenova/socialmedia/allus/test").map(_.split(","))

Can someone help me?

I would really appreciate it

1

1 Answer 1

1
val rdd = sc.textFile("/user/ergorenova/socialmedia/allus/test")
            .map( _.split(",",-1) match {

               case Array(canal, username, email) => (canal, username, email)

               case Array(canal, username, email, age) => (canal, username, email)
            })

You will obtain a tuple made out of the first,second and third column.

Sign up to request clarification or add additional context in comments.

7 Comments

Thank you very much, but now, i have other issue, if i don´t have the last element, for example without age, the code fail, how i can solve this?
error in the second line "not found: value age" , another way? thanks for response
if edit this line "case Array(canal, username, email) => (canal, username, age)" with "case Array(canal, username, email) => (canal, username, "")" works, this solution is right??
Sorry, the last part should be email and not age. I mixed them.
not problem, the issue is that now i want age, and i don´t have this data in one record on this file, and i have an error.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.