Spark .csv viariable number of columns

Question

I have a case class like that:

case class ResultDays (name: String, number: Double, values: Double*)

and I want to save it into a .csv file

resultRDD.toDF()
  .coalesce(1)
  .write.format("com.databricks.spark.csv")
  .option("header", "true")
  .save("res/output/result.csv")

Unfortunately I have this error:

java.lang.UnsupportedOperationException: CSV data source does not support array<double> data type.

So, how can I insert a variable number of values and save it into a .csv?

CSV, as a format, does not support a variable number of values, in the sense that all records must have the same columns. Do you know anything about the number of values expected? Perhaps the maximum number of values the values member might have? — Tzach Zohar
– Tzach Zohar, Commented Feb 13, 2017 at 10:28
I've to write the same number of values for every row, but I don't know how many values I have before run. — Francesco Gusmeroli
– Francesco Gusmeroli, Commented Feb 13, 2017 at 10:32
OK - but once you have resultRDD, you can assume all records have the same number of values? — Tzach Zohar
– Tzach Zohar, Commented Feb 13, 2017 at 11:00
Yes, but using ResultDays class it contains Double* and it seems I cannot use that — Francesco Gusmeroli
– Francesco Gusmeroli, Commented Feb 13, 2017 at 11:03

Tzach Zohar · Accepted Answer · 2017-02-13 11:22:24Z

1

If you can assume all records in resultRDD have the same number of columns in values - you can read the first() record, use it to determine the number of values in the arrays, and convert these arrays into separate columns:

// determine number of "extra" columns:
val extraCols = resultRDD.first().values.size

// create a sequence of desired columns:
val columns = Seq($"name", $"number") ++ (1 to extraCols).map(i => $"values"(i - 1) as s"col$i")

// select the above columns before saving:
resultRDD.toDF()
  .select(columns: _*)
  .coalesce(1)
  .write.format("com.databricks.spark.csv")
  .option("header", "true")
  .save("res/output/result.csv")

Example CSV result would be something like:

name,number,col1,col2
a,0.1,0.01,0.001
b,0.2,0.02,0.002
c,0.3,0.03,0.003

edited Feb 13, 2017 at 11:22

answered Feb 13, 2017 at 11:09

Tzach Zohar

37.9k3 gold badges83 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark .csv viariable number of columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related