-1

The format of input data likes below:

+--------------------+-------------+--------------------+
|           StudentID|       Right |             Wrong  |
+--------------------+-------------+--------------------+
|       studentNo01  |       a,b,c |            x,y,z   |
+--------------------+-------------+--------------------+
|       studentNo02  |         c,d |              v,w   |
+--------------------+-------------+--------------------+

And the format of output likes below():

+--------------------+---------+
|           key      |    value|
+--------------------+---------+
|     studentNo01,a  |       1 |
+--------------------+---------+
|     studentNo01,b  |       1 |
+--------------------+---------+
|     studentNo01,c  |       1 | 
+--------------------+---------+
|     studentNo01,x  |       0 | 
+--------------------+---------+
|     studentNo01,y  |       0 | 
+--------------------+---------+
|     studentNo01,z  |       0 | 
+--------------------+---------+
|     studentNo02,c  |       1 | 
+--------------------+---------+
|     studentNo02,d  |       1 | 
+--------------------+---------+
|     studentNo02,v  |       0 | 
+--------------------+---------+
|     studentNo02,w  |       0 | 
+--------------------+---------+

The Right means 1 , The Wrong means 0.

I want to process these data using Spark map function or udf, But I don't know how to deal with it . Can you help me, please? Thank you.

1
  • Do you want a data frame input and a RDD output? Commented Oct 13, 2016 at 11:55

1 Answer 1

3

Use split and explode twice and do the union

val df = List(
  ("studentNo01","a,b,c","x,y,z"),
  ("studentNo02","c,d","v,w")
  ).toDF("StudenID","Right","Wrong")

+-----------+-----+-----+
|   StudenID|Right|Wrong|
+-----------+-----+-----+
|studentNo01|a,b,c|x,y,z|
|studentNo02|  c,d|  v,w|
+-----------+-----+-----+


val pair = (
  df.select('StudenID,explode(split('Right,",")))
    .select(concat_ws(",",'StudenID,'col).as("key"))
    .withColumn("value",lit(1))
).unionAll(
  df.select('StudenID,explode(split('Wrong,",")))
    .select(concat_ws(",",'StudenID,'col).as("key"))
    .withColumn("value",lit(0))
)


+-------------+-----+
|          key|value|
+-------------+-----+
|studentNo01,a|    1|
|studentNo01,b|    1|
|studentNo01,c|    1|
|studentNo02,c|    1|
|studentNo02,d|    1|
|studentNo01,x|    0|
|studentNo01,y|    0|
|studentNo01,z|    0|
|studentNo02,v|    0|
|studentNo02,w|    0|
+-------------+-----+

You can convert to RDD as follows

val rdd = pair.map(r => (r.getString(0),r.getInt(1)))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.