2

I am trying, for some reason, to cast all the fields of a dataframe (with nested structTypes) to String.

I have already seen some solutions in StackOverflow (but they only work on simple dataframes without nested structs) (like here how to cast all columns of dataframe to string )

I'll explain what I really need via an example :

    import org.apache.spark.sql.{Row, SparkSession}
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.functions._
    import spark.implicits._
    val rows1 = Seq(
    Row(1, Row("a", "b"), 8.00, Row(1,2)),
    Row(2, Row("c", "d"), 9.00, Row(3,4))
    )

    val rows1Rdd = spark.sparkContext.parallelize(rows1, 4)

    val schema1 = StructType(
    Seq(
    StructField("id", IntegerType, true),
    StructField("s1", StructType(
    Seq(
    StructField("x", StringType, true),
    StructField("y", StringType, true)
    )
    ), true),
    StructField("d", DoubleType, true),
    StructField("s2", StructType(
    Seq(
    StructField("u", IntegerType, true),
    StructField("v", IntegerType, true)
    )
    ), true)
    )
    )

    val df1 = spark.createDataFrame(rows1Rdd, schema1)

    println("Schema with nested struct")
    df1.printSchema()

If we print the schema of the created dataframe, we have the following result :

root
|-- id: integer (nullable = true)
|-- s1: struct (nullable = true)
|    |-- x: string (nullable = true)
|    |-- y: string (nullable = true)
|-- d: double (nullable = true)
|-- s2: struct (nullable = true)
|    |-- u: integer (nullable = true)
|    |-- v: integer (nullable = true)

I tried to cast all the values to string as follows :

  df1.select(df1.columns.map(c => col(c).cast(StringType)) : _*)

But it transforms the nested structTypes to string instead of casting each value of it to String:

root
|-- id: string (nullable = true)
|-- s1: string (nullable = true)
|-- d: string (nullable = true)
|-- s2: string (nullable = true)

Is there a simple solution which will help me to cast all the values to a StringType ? Here's the StructType That I want to have as a schema for my dataframe after the cast :

root
|-- id: string (nullable = true)
|-- s1: struct (nullable = true)
|    |-- x: string (nullable = true)
|    |-- y: string (nullable = true)
|-- d: string (nullable = true)
|-- s2: struct (nullable = true)
|    |-- u: string (nullable = true)
|    |-- v: string (nullable = true)

Thanks a lot !

2
  • what do you expect casting a struct to string should be? Can you show an example dataframe and its output after the conversion? Commented Jul 25, 2018 at 10:27
  • The output of the dataframe should be the same, it's just the types which will change to StringType. But all the types (integers, doubles, etc ...) should be casted to a string. Commented Jul 25, 2018 at 12:11

3 Answers 3

4

After some days of investigations, I found the best solution for my question:

val newSchema = StructType(
Seq(
StructField("id", StringType, true),
StructField("s1", StructType(
Seq(
StructField("x", StringType, true),
StructField("y", StringType, true)
)
), true),
StructField("d", StringType, true),
StructField("s2", StructType(
Seq(
StructField("u", StringType, true),
StructField("v", StringType, true)
)
), true)
)
)
val expressions = newSchema.map(
  field => s"CAST ( ${field.name} As ${field.dataType.sql}) ${field.name}"
)
val result = df1.selectExpr(expressions : _*)
result.show()
+---+------+---+------+
| id|    s1|  d|    s2|
+---+------+---+------+
|  1|[a, b]|8.0|[1, 2]|
|  2|[c, d]|9.0|[3, 4]|
+---+------+---+------+

I hope it's going to help someone, I spent a lot of time trying to find this generic solution (I needed it since I was working with large dataframes and with many columns which needed to be casted).

Sign up to request clarification or add additional context in comments.

4 Comments

This doesn't do anything. It will return the same dataframe.
I edited the solution, I was using df1.schema instead of schema1 to create the expressions. Now it works like charm
As I said, it doesn't do anything. You are not casting to string. You are just casting it to it's original type.
My bad, I forgot to provide a schema. I edited the answer
2

You can create SQL expressions for the simpler type and struct type columns seperately.

The solution is not very generic but should work as long as you only have only struct types as complex columns. The code can handle variable number of columns under the struct and not just two.

val structCastExpression = df1.schema
                              .filter(_.dataType.isInstanceOf[StructType])
                              .map(c=> (c.name, c.dataType.asInstanceOf[StructType].map(_.name)))
                              .map{ case (col, sub) =>  s"""cast(${col} as struct${sub.map{ c => s"$c:string" }.mkString("<" , "," , ">")} ) as $col"""}
//List(cast(s1 as struct<x:string,y:string> ) as s1,
//     cast(s2 as struct<u:string,v:string> ) as s2)

val otherColumns = df1.schema
                      .filterNot(_.dataType.isInstanceOf[StructType])
                      .map( c=> s""" cast(${c.name} as string) as ${c.name} """)
//List(" cast(id as string) as id ", " cast(d as string) as d ")

//original columns
val originalColumns = df1.columns

// Union both the expressions into one big expression
val finalExpression = otherColumns.union(structCastExpression)
// List(" cast(id as string) as id ", 
//      " cast(d as string) as d ", 
//      cast(s1 as struct<x:string,y:string> ) as s1, 
//      cast(s2 as struct<u:string,v:string> ) as s2 )

// Use `selectExpr` to pass the expression 

df1.selectExpr(finalExpression : _*)
   .select(originalColumns.head, originalColumns.tail: _*)
   .printSchema

//root
// |-- id: string (nullable = true)
// |-- s1: struct (nullable = true)
// |    |-- x: string (nullable = true)
// |    |-- y: string (nullable = true)
// |-- d: string (nullable = true)
// |-- s2: struct (nullable = true)
// |    |-- u: string (nullable = true)
// |    |-- v: string (nullable = true)

5 Comments

This solution works like charm ! But is there a way to keep the order of the fields the same as before in the result dataframe ?
Something like .select(df1.columns.head , df1.columns.tail:_*) after the selectExpr should work, provided df1 is the original DataFrame
Yes perfect ! I validated your answer, thanks a lot for your help
what if I want to keep arrays too ? How can we adapt this solution ?
Hello @philantrovert, I found a generic solution. Please checkout my answer
0

You can do with udf and custom case class as below

case class s2(u:String,v:String)
def changeToStr(row:Row):s2={
    return s2(row.get(0).toString(),row.get(1).toString())
  }

val changeToStrUDF=udf(changeToStr _)
val df2=df1.select(df1.col("id").cast(StringType),df1.col("s1"),df1.col("d").cast(StringType),changeToStrUDF(df1.col("s2")).alias("s2"))

3 Comments

This solution might work when you have 4 columns like in the example, you can then recreate the schema with all values as string and recreate the schema. But this is not the case in a real world example, I have a dataframe with many columns, with multiples nested structtypes, and the main idea is to change all the fields' types to string iteratively, without having to redefine a schema or case classes.
Yes, It will work when you are aware about schema. I will try with another approach and let you know.
Thanks a lot for your help

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.