10

I have a data frame with column: user, address1, address2, address3, phone1, phone2 and so on. I want to convert this data frame to - user, address, phone where address = Map("address1" -> address1.value, "address2" -> address2.value, "address3" -> address3.value)

I was able to convert the columns to map using:

val mapData = List("address1", "address2", "address3")
df.map(_.getValuesMap[Any](mapData))

but I am not sure how to add this to my df.

I am new to spark and scala and could really use some help here.

1 Answer 1

13

Spark >= 2.0

You can skip udf and use map (create_map in Python) SQL function:

import org.apache.spark.sql.functions.map

df.select(
  map(mapData.map(c => lit(c) :: col(c) :: Nil).flatten: _*).alias("a_map")
)

Spark < 2.0

As far as I know there is no direct way to do it. You can use an UDF like this:

import org.apache.spark.sql.functions.{udf, array, lit, col}

val df = sc.parallelize(Seq(
  (1L, "addr1", "addr2", "addr3")
)).toDF("user", "address1", "address2", "address3")

val asMap = udf((keys: Seq[String], values: Seq[String]) => 
  keys.zip(values).filter{
    case (k, null) => false
    case _ => true
  }.toMap)

val keys = array(mapData.map(lit): _*)
val values = array(mapData.map(col): _*)

val dfWithMap = df.withColumn("address", asMap(keys, values))

Another option, which doesn't require UDFs, is to struct field instead of map:

val dfWithStruct = df.withColumn("address", struct(mapData.map(col): _*))

The biggest advantage is that it can easily handle values of different types.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.