Writing Custom Spark functions on Spark columns/ Dataframe

Question

I want to normalize Names of authors by removing the accents

Input:  orčpžsíáýd
Output: orcpzsiayd

The code below will allow me the achieve this. How ever I am not sure how i can do this using spark functions where my input is dataframe col.

def stringNormalizer(c : Column) = (
    import org.apache.commons.lang.StringUtils
    return StringUtils.stripAccents(c.toString)
)

The way i should be able to call it

val normalizedAuthor = flat_author.withColumn("NormalizedAuthor",      
stringNormalizer(df_article("authors")))

I have just started learning spark. So please let me know if there is a better way to achieve this without UDFs.

user6022341 · Accepted Answer · 2016-03-16 21:53:27Z

1

It requires an udf:

val stringNormalizer = udf((s: String) => StringUtils.stripAccents(s))

df_article.select(stringNormalizer(col("authors")))

answered Mar 16, 2016 at 21:53

community wiki

user6022341

Sign up to request clarification or add additional context in comments.

Comments

user19700827 · Accepted Answer · 2023-01-11 22:58:45Z

Although it doesn't look as pretty, I found that it took half the amount of time to remove accents like this without a UDF:

def withColumnFormated(columnName: String)(df: DataFrame): DataFrame = {
  val dfWithColumnUpper = df.withColumn(columnName, upper(col(columnName)))
  val accents: Map[String, String] = Map("[ÃÁÀÂÄ]" -> "A", "[ÉÈÊË]" -> "E", "[ÍÌÎÏ]" -> "I", 
                                         "[Ñ]" -> "N", "[ÓÒÔÕÖ]" -> "O", "[ÚÙÛÜ]" -> "U", 
                                         "[Ç]" -> "C")
  
  accents.foldLeft(dfWithColumnUpper){
    (tempDf, replace_element) => tempDf.withColumn(columnName,
                                                   regexp_replace(col(columnName),
                                                                  lit(replace_element._1),
                                                                  lit(replace_element._2)))
    }
  }

And then you can apply it like this:

df_article.transform(withColumnFormated("authors"))

Collectives™ on Stack Overflow

Writing Custom Spark functions on Spark columns/ Dataframe

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related