25

Similar question as here, but don't have enough points to comment there.

According to the latest Spark documentation an udf can be used in two different ways, one with SQL and another with a DataFrame. I found multiple examples of how to use an udf with sql, but have not been able to find any on how to use a udf directly on a DataFrame.

The solution provided by the o.p. on the question linked above uses __callUDF()__ which is _deprecated_ and will be removed in Spark 2.0 according to the Spark Java API documentation. There, it says:

"since it's redundant with udf()"

so this means I should be able to use __udf()__ to cal a my udf, but I can't figure out how to do that. I have not stumbled on anything that spells out the syntax for Java-Spark programs. What am I missing?

import org.apache.spark.sql.api.java.UDF1;
.
.    
UDF1 mode = new UDF1<String[], String>() {
    public String call(final String[] types) throws Exception {
        return types[0];
    }
};

sqlContext.udf().register("mode", mode, DataTypes.StringType);
df.???????? how do I call my udf (mode) on a given column of my DataFrame df?
5
  • It is not. Check carefully signatures :) Some example code? UDF + data? Some formatting? Commented Feb 11, 2016 at 20:57
  • Added code to clarify what I'm asking. As for the complaining part, I have a nagging feeling that I'm not doing it right. It should not take hours to figure out how to do things in Java-Spark. I think I'm missing something, some book(s), some documentation somewhere, some source of knowledge that will make the clues I get from my IDE sufficient to do things without having to google for hours. Everything I find is Scala and it's not clear at all to me how to do the same things in Java. Commented Feb 11, 2016 at 21:17
  • Well, technically speaking Scala classes are valid Java classes. It means these can be used directly in Java. Problem is that Scala is much richer language than Java. It means that many things cannot be easily done without unwrapping all the Scala magic. Commented Feb 11, 2016 at 22:05
  • so you're telling me that I do need to move to Scala.. it does seem that this would be a better investment of time than keep trying to shoehorn Spark into Java code. Thank you. Commented Feb 12, 2016 at 13:22
  • Not necessarily but it can be easier for you than dealing with Scala internals. Commented Feb 12, 2016 at 13:30

1 Answer 1

36

Spark >= 2.3

Scala-style udf can be invoked directly:

import static org.apache.spark.sql.functions.*;
import org.apache.spark.sql.expressions.UserDefinedFunction;

UserDefinedFunction mode = udf(
  (Seq<String> ss) -> ss.headOption(), DataTypes.StringType
);

df.select(mode.apply(col("vs"))).show();

Spark < 2.3

Even if we assume that your UDF is useful and cannot be replaced by a simple getItem call it has incorrect signature. Array columns are exposed using Scala WrappedArray not plain Java Arrays so you have to adjust the signature:

UDF1 mode = new UDF1<Seq<String>, String>() {
  public String call(final Seq<String> types) throws Exception {
    return types.headOption();
  }
};

If UDF is already registered:

sqlContext.udf().register("mode", mode, DataTypes.StringType);

you can simply use callUDF (which is a new function introduced in 1.5) to call it by name:

df.select(callUDF("mode", col("vs"))).show();

You can also use it in selectExprs:

df.selectExpr("mode(vs)").show();
Sign up to request clarification or add additional context in comments.

3 Comments

First, thank you. The udf is simplified a little, the one I would end up writing will return a single String that is a function of the String array in the column (row by row, no aggregation). This seemed like a perfect case for one of those functions in the spark.sql.functions suite, but what I need (most frequent item in string array) is not there, hence me trying to develop my own udf().
I successfully used this solution to split the Spark ML "probability" vector column into multiple columns using Java code. Hopefully, this will help someone find the solution faster than I did.
In Spark >= 2.3, how do you pass multiple columns to the UDF defined in the answer?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.