3
test.csv
name,key1,key2
A,1,2
B,1,3
C,4,3

I want to change this data like this (as dataset or rdd)

whatIwant.csv
name,key,newkeyname
A,1,KEYA
A,2,KEYB
B,1,KEYA
B,3,KEYB
C,4,KEYA
C,3,KEYB

I loaded data with read method.

val df = spark.read
            .option("header", true)
            .option("charset", "euc-kr")
            .csv(csvFilePath)

I can load each dataset like (name, key1) or (name, key2), and union them by union, but want to do this in single spark session. Any idea of this?


Those are not working.

val df2 = df.select( df("TAG_NO"), df.map { x => (x.getAs[String]("MK_VNDRNM"), x.getAs[String]("WK_ORD_DT")) })

val df2 = df.select( df("TAG_NO"), Seq(df("TAG_NO"), df("WK_ORD_DT")))
5
  • Did you try explode function from DataFrame? Commented Nov 15, 2016 at 1:57
  • nope. i'll try with explode. thanks :) Commented Nov 15, 2016 at 2:01
  • Since key1 and key2 are not in single column, I think explode is not the right answer. Commented Nov 15, 2016 at 2:13
  • You can convert key1, key2 as tuple by applying map function. Commented Nov 15, 2016 at 2:16
  • could you give me some example for this? Commented Nov 15, 2016 at 4:40

1 Answer 1

2

This can be accomplished with explode and a udf:

scala> val df = Seq(("A", 1, 2), ("B", 1, 3), ("C", 4, 3)).toDF("name", "key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, key1: int ... 1 more field]

scala> df.show
+----+----+----+
|name|key1|key2|
+----+----+----+
|   A|   1|   2|
|   B|   1|   3|
|   C|   4|   3|
+----+----+----+

scala> val explodeUDF = udf((v1: String, v2: String) => Vector((v1, "Key1"), (v2, "Key2")))
explodeUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StructType(StructField(_1,StringType,true), StructField(_2,StringType,true)),true),Some(List(StringType, StringType)))

scala> df = df.withColumn("TMP", explode(explodeUDF($"key1", $"key2"))).drop("key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string>]

scala> df = df.withColumn("key", $"TMP".apply("_1")).withColumn("new key name", $"TMP".apply("_2"))
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string> ... 2 more fields]

scala> df = df.drop("TMP")
df: org.apache.spark.sql.DataFrame = [name: string, key: string ... 1 more field]

scala> df.show
+----+---+------------+
|name|key|new key name|
+----+---+------------+
|   A|  1|        Key1|
|   A|  2|        Key2|
|   B|  1|        Key1|
|   B|  3|        Key2|
|   C|  4|        Key1|
|   C|  3|        Key2|
+----+---+------------+
Sign up to request clarification or add additional context in comments.

1 Comment

profit! It's bit different from my origin problem but can make it with this. thanks alot :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.