My spark application reads a csv file, transforms it to a different format with sql and writes the result dataframe to a different csv file.
For example, I have input csv as follows:
Id|FirstName|LastName|LocationId
1|John|Doe|123
2|Alex|Doe|234
My transformation is:
Select Id,
FirstName,
LastName,
LocationId as PrimaryLocationId,
null as SecondaryLocationId
from Input
(I can't answer why the null is being used as SecondaryLocationId, it is business use case) Now spark can't figure out the datatype of SecondaryLocationId and returns null in the schema and throws the error CSV data source does not support null data type while writing to output csv.
Below are printSchema() and write options I am using.
root
|-- Id: string (nullable = true)
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
|-- PrimaryLocationId: string (nullable = false)
|-- SecondaryLocationId: null (nullable = true)
dataFrame.repartition(1).write
.mode(SaveMode.Overwrite)
.option("header", "true")
.option("delimiter", "|")
.option("nullValue", "")
.option("inferSchema", "true")
.csv(outputPath)
Is there a way to default to a datatype (such as string)? By the way, I can get this to work by replacing null with empty string('') but that is not what I want to do.