how to split strings in dataframe using pandas_udf in pyspark

Question

I have a dataframe of one column only. I would like to split the string using the pandas_udf in pyspark. Hence, I have the following code:

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('str')
def split_msg(string):
  msg_ = string.split(" ")
  return msg_

temp = temp.select("_c6").withColumn("decoded", 
split_msg(temp._c6)).drop("_c6")

But this is not working.

any help is much appreciated!!

jxc · Accepted Answer · 2019-10-03 16:22:03Z

2

Change your function to the following:

@pandas_udf('array<string>', PandasUDFType.SCALAR) 
def split_msg(string): 
    msg_ = string.str.split(" ") 
    return msg_

basically, your function returnType should be array of StringType() and the argument string should be a Series and thus you will need string.str.split(" ")

However, if you just want to split the text, Spark's DataFrame API provides a built-in function, pyspark.sql.functions.split which should be more efficient than using a pandas_udf

answered Oct 3, 2019 at 16:22

jxc

14k4 gold badges20 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

how to split strings in dataframe using pandas_udf in pyspark

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related