1

I am using PySpark in Jupyter on Azure. I am trying to test using a UDF on a dataframe however, the UDF is not executing.

My dataframe is created by:

users = sqlContext.sql("SELECT DISTINCT userid FROM FoodDiaryData")

I have confirmed this dataframe is populated with 100 rows. In the next cell I try to execute a simple udf.

def iterateMeals(user):
    print user

users.foreach(iterateMeals)

This produces no output. I would have expected each entry in the dataframe to have been printed. However, if I simply try iterateMeals('test') it will fire and print 'test'. I also tried using pyspark.sql.functions

from pyspark.sql.functions import udf

def iterateMeals(user):
    print user
f_iterateMeals = udf(iterateMeals,LongType())

users.foreach(f_iterateMeals)

When I try this, I receive the following error:

Py4JError: An error occurred while calling o461.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist

Can someone explain where I have went wrong? I will be needing to execute udfs inside the .foreach of dataframes for this application.

0

1 Answer 1

2
  1. You won't see an output because print is executed on worker nodes and goes to the respective output. See Why does foreach not bring anything to the driver program? for a complete explanation.

  2. foreach operates on a RDD not a DataFrame. UDFs are not valid in this context.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.