Spark - Make dataframe with multi column csv

Question

origin.csv
no,key1,key2,key3,key4,key5,...
1,A1,B1,C1,D1,E1,..
2,A2,B2,C2,D2,E2,..
3,A3,B3,C3,D3,E3,..


WhatIwant.csv
1,A1,key1
1,B1,key2
1,C1,key3
...
3,A3,key1
3,B3,key2
...

I loaded csv with read method(origin.csv dataframe), but unable to convert it.

val df = spark.read
            .option("header", true)
            .option("charset", "euc-kr")
            .csv(csvFilePath)

Any idea of this?

abaghel · Accepted Answer · 2016-11-15 05:41:07Z

1

Try this.

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._

val df = Seq((1,"A1","B1","C1","D1"), (2,"A2","B2","C2","D2"), (3,"A3","B3","C3","D2")).toDF("no", "key1", "key2","key3","key4")
df.show

def myUDF(df: DataFrame, by: Seq[String]): DataFrame = {
  val (columns, types) = df.dtypes.filter{ case (clm, _) => !by.contains(clm)}.unzip
  require(types.distinct.size == 1)      
  val keys = explode(array(
    columns.map(clm => struct(lit(clm).alias("key"),col(clm).alias("val"))): _*
  ))
  val byValue = by.map(col(_))
  df.select(byValue :+ keys.alias("_key"): _*).select(byValue ++ Seq($"_key.val", $"_key.key"): _*)

}

val df1 = myUDF(df, Seq("no"))
df1.show

answered Nov 15, 2016 at 5:41

abaghel

15.4k2 gold badges56 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

J.Done Over a year ago

Got error from Seq($"_key.val. Error message is "value $ is not a member of StringContext"

abaghel Over a year ago

I tested this with spark-shell on Spark version 2.0.0. What version are you using?

J.Done Over a year ago

i'm using org.scala-lang:scala-library:2.11.1

abaghel Over a year ago

Try adding import for "import sqlContext.implicits._".

Collectives™ on Stack Overflow

Spark - Make dataframe with multi column csv

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related