Imagine the following input:
val data = Seq (("1::Alice"), ("2::Bob"))
val dfInput = data.toDF("input")
val dfTwoColTypeString = dfInput.map(row => row.getString(0).split("::")).map{ case Array(id, name) => (id, name) }.toDF("id", "name")
Now I have a DataFrame with the columns as wished:
scala> dfTwoColTypeString.show
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
+---+-----+
Of course I would like to have the column id of type int, but it is of type String:
scala> dfTwoColTypeString.printSchema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
Therefore I define this schema:
val mySchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true)
))
What is the best way to cast or convert the DataFrame dfTwoColTypeString to the given target schema.
Bonus: If the given input cannot be cast or converted to the target schema I would love to get a null row with an extra column "bad_record" containing the bad input data. That is, I want to accomplish the same, as the CSV parser in PERMISSIVE mode.
Any help really appreciated.