I'm trying to read in flight data from the Department of Transportation. It is stored in a CSV, and keep getting java.lang.NumberFormatException: null
I have tried setting the nanValue to the empty string, as it's default value is NaN, but this hasn't worked.
My current code is:
spark = SparkSession.builder \
.master('local') \
.appName('Flight Delay') \
.getOrCreate()
schema = StructType([
StructField('Year', IntegerType(), nullable=True),
StructField('Month', IntegerType(), nullable=True),
StructField('Day', IntegerType(), nullable=True),
StructField('Dow', IntegerType(), nullable=True),
StructField('CarrierId', StringType(), nullable=True),
StructField('Carrier', StringType(), nullable=True),
StructField('TailNum', StringType(), nullable=True),
StructField('Origin', StringType(), nullable=True),
StructField('Dest', StringType(), nullable=True),
StructField('CRSDepTime', IntegerType(), nullable=True),
StructField('DepTime', IntegerType(), nullable=True),
StructField('DepDelay', DoubleType(), nullable=True),
StructField('TaxiOut', DoubleType(), nullable=True),
StructField('TaxiIn', DoubleType(), nullable=True),
StructField('CRSArrTime', IntegerType(), nullable=True),
StructField('ArrTime', IntegerType(), nullable=True),
StructField('ArrDelay', DoubleType(), nullable=True),
StructField('Cancelled', DoubleType(), nullable=True),
StructField('CancellationCode', StringType(), nullable=True),
StructField('Diverted', DoubleType(), nullable=True),
StructField('CRSElapsedTime', DoubleType(), nullable=True),
StructField('ActualElapsedTime', DoubleType(), nullable=True),
StructField('AirTime', DoubleType(), nullable=True),
StructField('Distance', DoubleType(), nullable=True),
StructField('CarrierDelay', DoubleType(), nullable=True),
StructField('WeatherDelay', DoubleType(), nullable=True),
StructField('NASDelay', DoubleType(), nullable=True),
StructField('SecurityDelay', DoubleType(), nullable=True),
StructField('LateAircraftDelay', DoubleType(), nullable=True)
])
flts = spark.read \
.format('com.databricks.spark.csv') \
.csv('/home/william/Projects/flight-delay/data/201601.csv',
schema=schema, nanValue='', header='true')
Here is the CSV I'm working with: http://pastebin.com/waahrgqB
The last row there is where it breaks and raises the java.lang.NumberFormatException: null
It seems that some numeric columns are empty strings, while others are just empty. Can someone please help me with this?
importstatements are missing I think.