0

I want to normalize Names of authors by removing the accents

Input:  orčpžsíáýd
Output: orcpzsiayd

The code below will allow me the achieve this. How ever I am not sure how i can do this using spark functions where my input is dataframe col.

def stringNormalizer(c : Column) = (
    import org.apache.commons.lang.StringUtils
    return StringUtils.stripAccents(c.toString)
)

The way i should be able to call it

val normalizedAuthor = flat_author.withColumn("NormalizedAuthor",      
stringNormalizer(df_article("authors")))

I have just started learning spark. So please let me know if there is a better way to achieve this without UDFs.

2 Answers 2

1

It requires an udf:

val stringNormalizer = udf((s: String) => StringUtils.stripAccents(s))

df_article.select(stringNormalizer(col("authors")))
Sign up to request clarification or add additional context in comments.

Comments

1

Although it doesn't look as pretty, I found that it took half the amount of time to remove accents like this without a UDF:

def withColumnFormated(columnName: String)(df: DataFrame): DataFrame = {
  val dfWithColumnUpper = df.withColumn(columnName, upper(col(columnName)))
  val accents: Map[String, String] = Map("[ÃÁÀÂÄ]" -> "A", "[ÉÈÊË]" -> "E", "[ÍÌÎÏ]" -> "I", 
                                         "[Ñ]" -> "N", "[ÓÒÔÕÖ]" -> "O", "[ÚÙÛÜ]" -> "U", 
                                         "[Ç]" -> "C")
  
  accents.foldLeft(dfWithColumnUpper){
    (tempDf, replace_element) => tempDf.withColumn(columnName,
                                                   regexp_replace(col(columnName),
                                                                  lit(replace_element._1),
                                                                  lit(replace_element._2)))
    }
  }

And then you can apply it like this:

df_article.transform(withColumnFormated("authors"))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.