4

If I understand correctly, ArrayType can be added as Spark DataFrame columns. I am trying to add a multidimensional array to an existing Spark DataFrame by using the withColumn method. My idea is to have this array available with each DataFrame row in order to use it to send back information from the map function.

The error I get says that the withColumn function is looking for a Column type but it is getting an array. Are there any other functions that will allow adding an ArrayType?

    object TestDataFrameWithMultiDimArray {
  val nrRows = 1400
  val nrCols = 500

  /** Our main function where the action happens */
  def main(args: Array[String]) {

    // Create a SparkContext using every core of the local machine, named RatingsCounter
    val sc = new SparkContext("local[*]", "TestDataFrameWithMultiDimArray")  
    val sqlContext = new SQLContext(sc)

    val PropertiesDF = sqlContext.read
    .format("com.crealytics.spark.excel")
    .option("location", "C:/Users/tjoha/Desktop/Properties.xlsx")
    .option("useHeader", "true")
    .option("treatEmptyValuesAsNulls", "true")
    .option("inferSchema", "true")
    .option("addColorColumns", "False")
    .option("sheetName", "Sheet1")
    .load()

    PropertiesDF.show()
    PropertiesDF.printSchema()

    val PropertiesDFPlusMultiDimArray = PropertiesDF.withColumn("ArrayCol", Array.ofDim[Any](nrRows,nrCols))

  }

Thanks for your help.

Kind regards,

Johann

1 Answer 1

1

There are 2 problems in your code

  1. the 2nd argument to withColumn needs to be a Column. you can wrap constant value with function col
  2. Spark cant take Any as its column type, you need to use a specific supported type.

    val PropertiesDFPlusMultiDimArray = PropertiesDF.withColumn("ArrayCol", lit(Array.ofDim[Int](nrRows,nrCols)))

will do the trick

Sign up to request clarification or add additional context in comments.

4 Comments

Hi Will, thanks for your answer. What does lit() do? My end-goal is to calculate multiple rows and columns of values for each line of data in the DataFrame and to return it as an array. There will be multiple different types of data in the array, including strings, integers and floating point numbers. Do you have any idea as to achieve this type of functionality. Also, is the array you set up above compatible with a multi dimensional array? Once I have answers I will create a new post as the answer led to new questions.
I see lit() creates a column of literal value.
@TJVR array in statically typed language always have the same type for all elements. Maybe you want to either create a new StructType for this column, or add all array as a column each instead?
wow how did you dig up that ability to specify a type within the literal function?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.