I'm getting the following error trying to build a ML Pipeline:
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType(DoubleType,true).'
My features column contains an array of floating point values. It sounds like I need to convert those to some type of vector (it's not sparse, so a DenseVector?). Is there a way to do this directly on the DataFrame or do I need to convert to an RDD?
array_to_vectorfrompyspark.ml.functionsto convert the array column to a vector type. Only available in pyspark>=3.1.0. More details here: stackoverflow.com/a/48333361/2650427.