1

I have a local CSV "test.csv" where the first row is the column names and the following rows are data. I tried reading in the CSV like this in Java:

Dataset<Row> test_table = sparkSession()
    .sqlContext()
    .read()
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("test.csv");

This was suggested here:
Read csv as Data Frame in spark 1.6

But I keep getting the error:

java.lang.NegativeArraySizeException
    at com.univocity.parsers.common.input.DefaultCharAppender.<init>(DefaultCharAppender.java:39)
    at com.univocity.parsers.csv.CsvParserSettings.newCharAppender(CsvParserSettings.java:82)
    at com.univocity.parsers.common.ParserOutput.<init>(ParserOutput.java:93)
    at com.univocity.parsers.common.AbstractParser.<init>(AbstractParser.java:74)
    at com.univocity.parsers.csv.CsvParser.<init>(CsvParser.java:59)
    at org.apache.spark.sql.execution.datasources.csv.CsvReader.<init>(CSVParser.scala:49)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:61)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
    at scala.Option.orElse(Option.scala:289)
    at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)

What's the problem and how can I read from the CSV into a dataset?

1

2 Answers 2

1

Author of the univocity-parsers library here. This is happening because internally spark is setting the maximum value length to -1 (meaning no limit). This was introduced in univocity-parsers versions 2.2.0 onward.

Just make sure this library version is greater than 2.2.0 and you should be fine, as the older versions don't support setting the maxCharsPerColumn property to -1.

If you have multiple versions of that library in your classpath, get rid of the older ones. Ideally you'd want to update to the latest version (currently 2.5.4.) and use only that. It should work just fine as we make sure any changes made to the library are backward compatible.

Sign up to request clarification or add additional context in comments.

Comments

0

It is mainly due to the dependencies you are using. Try using other like

   --packages com.databricks:spark-csv_2.10:1.5.0 or spark-csv_2.10:1.4.0 

It should work.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.