I have a local CSV "test.csv" where the first row is the column names and the following rows are data. I tried reading in the CSV like this in Java:
Dataset<Row> test_table = sparkSession()
.sqlContext()
.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("test.csv");
This was suggested here:
Read csv as Data Frame in spark 1.6
But I keep getting the error:
java.lang.NegativeArraySizeException
at com.univocity.parsers.common.input.DefaultCharAppender.<init>(DefaultCharAppender.java:39)
at com.univocity.parsers.csv.CsvParserSettings.newCharAppender(CsvParserSettings.java:82)
at com.univocity.parsers.common.ParserOutput.<init>(ParserOutput.java:93)
at com.univocity.parsers.common.AbstractParser.<init>(AbstractParser.java:74)
at com.univocity.parsers.csv.CsvParser.<init>(CsvParser.java:59)
at org.apache.spark.sql.execution.datasources.csv.CsvReader.<init>(CSVParser.scala:49)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:61)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
What's the problem and how can I read from the CSV into a dataset?