I am new to scala and trying to make custom schema from array of elements to read files based on a new custom schema.
I am reading the arrays from json file and used explode method and created a dataframe for each element in column array.
val otherPeople = sqlContext.read.option("multiline", "true").json(otherPeopleDataset)
val column_values = otherPeople.withColumn("columns", explode($"columns")).select("columns.*")
column_values.printSchema()
obtained output is :
column_values: org.apache.spark.sql.DataFrame = [column_id: string, data_sensitivty: string ... 3 more fields]
root
|-- column_id: string (nullable = true)
|-- data_sensitivty: string (nullable = true)
|-- datatype: string (nullable = true)
|-- length: string (nullable = true)
|-- name: string (nullable = true)
val column_values = ddb_schema.withColumn("columns", explode($"columns")).select("columns.*")
val column_name = column_values.select("name", "datatype", "length")
column_name.show(4)
+------------------+--------+------+
| name|datatype|length|
+------------------+--------+------+
| object_number| varchar| 100|
| function_type| varchar| 100|
| hof_1| decimal| 17,3|
| hof_2| decimal| 17,2|
| region| varchar| 100|
| country| varchar| null|
+------------------+--------+------+
Now for all the values listed above i am trying to creating val schema dynamically using below code
val schemaColumns = column_name.collect()
val schema = schemaColumns.foldLeft(new StructType())(
(schema, columnRow) => schema.add(columnRow.getAs[String]("name"), getFieldType(columnRow.getAs[String]("datatype")), true)
)
def getFieldType(typeName: String): DataType = typeName match {
case "varchar" => StringType
// TODO include other types here
case _ => StringType
}
problem with above is that i am able to get the datatypes in struct, but i would also like to get (scale and preicion) only for datatype decimal with a restriction condition that max allowable with a condition that if length for decimal if is null or not present we need to take default value as (10,0) and if value present is greater than 38 we need to take default value as (38,0)