Error reading local CSV into spark dataset

Question

I have a local CSV "test.csv" where the first row is the column names and the following rows are data. I tried reading in the CSV like this in Java:

Dataset<Row> test_table = sparkSession()
    .sqlContext()
    .read()
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("test.csv");

This was suggested here:
Read csv as Data Frame in spark 1.6

But I keep getting the error:

java.lang.NegativeArraySizeException
    at com.univocity.parsers.common.input.DefaultCharAppender.<init>(DefaultCharAppender.java:39)
    at com.univocity.parsers.csv.CsvParserSettings.newCharAppender(CsvParserSettings.java:82)
    at com.univocity.parsers.common.ParserOutput.<init>(ParserOutput.java:93)
    at com.univocity.parsers.common.AbstractParser.<init>(AbstractParser.java:74)
    at com.univocity.parsers.csv.CsvParser.<init>(CsvParser.java:59)
    at org.apache.spark.sql.execution.datasources.csv.CsvReader.<init>(CSVParser.scala:49)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:61)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
    at scala.Option.orElse(Option.scala:289)
    at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)

What's the problem and how can I read from the CSV into a dataset?

stackoverflow.com/questions/44710527/…

abaghel
– abaghel

2017-08-29 03:59:15 +00:00
Commented Aug 29, 2017 at 3:59 — abaghel
– abaghel, Commented Aug 29, 2017 at 3:59

Jeronimo Backes · Accepted Answer · 2017-09-02 20:20:04Z

1

Author of the univocity-parsers library here. This is happening because internally spark is setting the maximum value length to -1 (meaning no limit). This was introduced in univocity-parsers versions 2.2.0 onward.

Just make sure this library version is greater than 2.2.0 and you should be fine, as the older versions don't support setting the maxCharsPerColumn property to -1.

If you have multiple versions of that library in your classpath, get rid of the older ones. Ideally you'd want to update to the latest version (currently 2.5.4.) and use only that. It should work just fine as we make sure any changes made to the library are backward compatible.

answered Sep 2, 2017 at 20:20

Jeronimo Backes

6,2892 gold badges27 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Nayan Sharma · Accepted Answer · 2017-08-29 10:18:57Z

0

It is mainly due to the dependencies you are using. Try using other like

   --packages com.databricks:spark-csv_2.10:1.5.0 or spark-csv_2.10:1.4.0

It should work.

answered Aug 29, 2017 at 10:18

Nayan Sharma

1,85318 silver badges21 bronze badges

Collectives™ on Stack Overflow

Error reading local CSV into spark dataset

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related