Using spark java I have created dataframe on comma delimiter source file. In sourcefile if last column contains blank value then its throwing arrayindexoutofbound error. Below is sample data and code. is there any way I can handle this error because there is lot of chance getting blank values in last column. In below sample data 4th row causing issue.
Sample Data:
1,viv,chn,34
2,man,gnt,56
3,anu,pun,22
4,raj,bang,*
Code:
JavaRDD<String> dataQualityRDD = spark.sparkContext().textFile(inputFile, 1).toJavaRDD();
String schemaString = schemaColumns;
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = dataQualityRDD.map((Function<String, Row>) record -> {
// String[] attributes = record.split(attributes[0], attributes[1].trim());
Object[] items = record.split(fileSplit);
// return RowFactory.create(attributes[0], attributes[1].trim());
return RowFactory.create(items);
});
}
}

