1

Hello everyone!
I have a dataframe with 2510765 rows containing application reviews with relative score and having the following structure:

root
 |-- content: string (nullable = true)
 |-- score: string (nullable = true)

I wrote these two functions, to remove punctuation and remove emojis from text:

import string

def remove_punct(text):
    return text.translate(str.maketrans('', '', string.punctuation))

and

import re

def removeEmoji(text):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',text)

I use the udf function to create a spark function starting from the ones I defined for the removal of punctuation and emojis:

from pyspark.sql.functions import udf

punct_remove = udf(lambda s: remove_punct(s))

removeEmoji = udf(lambda s: removeEmoji(s))

But I get the following error:

TypeError                                 Traceback (most recent call last)

<ipython-input-29-e5d42d609b59> in <module>()
----> 1 new_df = new_df.withColumn("content", remove_punct(df_merge["content"]))
      2 new_df.show(5)

<ipython-input-21-dee888ef5b90> in remove_punct(text)
      2 
      3 def remove_punct(text):
----> 4     return text.translate(str.maketrans('', '', string.punctuation))
      5 
      6 

TypeError: 'Column' object is not callable

How can it be solved? Is there another way to make user-written functions run on the dataframe?
Thank you ;)

0

1 Answer 1

1

The stack trace suggests that you are calling the python method directly, not the udf.

remove_punct is a plain vanilla Python function while punct_remove is a udf that can be used as second parameter of the withColumn call.

One way to solve the problem is to use punct_remove instead of remove_punct in the withColumn call.

Another way to reduce the chance of mixing up the Python function with the udf is to use the @udf annotation:

from pyspark.sql import functions as F
from pyspark.sql import types as T

@F.udf(returnType=T.StringType())
def remove_punct(text):
    return text.translate(str.maketrans('', '', string.punctuation))

df.withColumn("content", remove_punct(F.col("content"))).show()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.