0

I have a dataframe of one column only. I would like to split the string using the pandas_udf in pyspark. Hence, I have the following code:

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('str')
def split_msg(string):
  msg_ = string.split(" ")
  return msg_

temp = temp.select("_c6").withColumn("decoded", 
split_msg(temp._c6)).drop("_c6")

But this is not working.

any help is much appreciated!!

1 Answer 1

2

Change your function to the following:

@pandas_udf('array<string>', PandasUDFType.SCALAR) 
def split_msg(string): 
    msg_ = string.str.split(" ") 
    return msg_ 

basically, your function returnType should be array of StringType() and the argument string should be a Series and thus you will need string.str.split(" ")

However, if you just want to split the text, Spark's DataFrame API provides a built-in function, pyspark.sql.functions.split which should be more efficient than using a pandas_udf

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.