PySpark: UDF is not executing on a dataframe

Question

I am using PySpark in Jupyter on Azure. I am trying to test using a UDF on a dataframe however, the UDF is not executing.

My dataframe is created by:

users = sqlContext.sql("SELECT DISTINCT userid FROM FoodDiaryData")

I have confirmed this dataframe is populated with 100 rows. In the next cell I try to execute a simple udf.

def iterateMeals(user):
    print user

users.foreach(iterateMeals)

This produces no output. I would have expected each entry in the dataframe to have been printed. However, if I simply try iterateMeals('test') it will fire and print 'test'. I also tried using pyspark.sql.functions

from pyspark.sql.functions import udf

def iterateMeals(user):
    print user
f_iterateMeals = udf(iterateMeals,LongType())

users.foreach(f_iterateMeals)

When I try this, I receive the following error:

Py4JError: An error occurred while calling o461.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist

Can someone explain where I have went wrong? I will be needing to execute udfs inside the .foreach of dataframes for this application.

3 revs · Accepted Answer · 2017-05-23 10:28:32Z

2

You won't see an output because print is executed on worker nodes and goes to the respective output. See Why does foreach not bring anything to the driver program? for a complete explanation.
foreach operates on a RDD not a DataFrame. UDFs are not valid in this context.

Collectives™ on Stack Overflow

PySpark: UDF is not executing on a dataframe

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related