Fillna PySpark Dataframe with numpy array Error

Question

The following is a sample of my Spark DataFrame with the printSchema below it:

+--------------------+---+------+------+--------------------+
|           device_id|age|gender| group|                apps|
+--------------------+---+------+------+--------------------+
|-9073325454084204615| 24|     M|M23-26|                null|
|-8965335561582270637| 28|     F|F27-28|[1.0,1.0,1.0,1.0,...|
|-8958861370644389191| 21|     M|  M22-|[4.0,0.0,0.0,0.0,...|
|-8956021912595401048| 21|     M|  M22-|                null|
|-8910497777165914301| 25|     F|F24-26|                null|
+--------------------+---+------+------+--------------------+
only showing top 5 rows

root
 |-- device_id: long (nullable = true)
 |-- age: integer (nullle = true)
 |-- gender: string (nullable = true)
 |-- group: string (nullable = true)
 |-- apps: vector (nullable = true)

I'm trying to fill the null in the 'apps' column with np.zeros(19237). However When I execute

df.fillna({'apps': np.zeros(19237)}))

I get an error

Py4JJavaError: An error occurred while calling o562.fill.
: java.lang.IllegalArgumentException: Unsupported value type java.util.ArrayList

Or if I try

df.fillna({'apps': DenseVector(np.zeros(19237)})))

I get an error

AttributeError: 'numpy.ndarray' object has no attribute '_get_object_id'

Any ideas?

zero323 · Accepted Answer · 2017-06-06 17:51:51Z

4

DataFrameNaFunctions support only a subset of native (no UDTs) types, so you'll need an UDF here.

from pyspark.sql.functions import coalesce, col, udf
from pyspark.ml.linalg import Vectors, VectorUDT

def zeros(n):
    def zeros_():
        return Vectors.sparse(n, {})
    return udf(zeros_, VectorUDT())()

Example usage:

df = spark.createDataFrame(
    [(1, Vectors.dense([1, 2, 3])), (2, None)],
    ("device_id", "apps"))

df.withColumn("apps", coalesce(col("apps"), zeros(3))).show()

+---------+-------------+
|device_id|         apps|
+---------+-------------+
|        1|[1.0,2.0,3.0]|
|        2|    (3,[],[])|
+---------+-------------+

edited Jun 6, 2017 at 17:51

answered Jun 6, 2017 at 17:46

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Fillna PySpark Dataframe with numpy array Error

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related