Pyspark removing multiple characters in a dataframe column

Question

Looking at pyspark, I see translate and regexp_replace to help me a single characters that exists in a dataframe column.

I was wondering if there is a way to supply multiple strings in the regexp_replace or translate so that it would parse them and replace them with something else.

Use case: remove all $, #, and comma(,) in a column A

pault · Accepted Answer · 2018-06-08 19:04:14Z

39

You can use pyspark.sql.functions.translate() to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values.

For example, let's say you had the following DataFrame:

import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"])
df.show()
#+------------------+
#|                 A|
#+------------------+
#|           $100,00|
#|           #foobar|
#|foo, bar, #, and $|
#+------------------+

and wanted to replace ('$', '#', ',') with ('X', 'Y', 'Z'). Simply use translate like:

df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show()
#+------------------+------------------+
#|                 A|          replaced|
#+------------------+------------------+
#|           $100,00|           X100Z00|
#|           #foobar|           Yfoobar|
#|foo, bar, #, and $|fooZ barZ YZ and X|
#+------------------+------------------+

If instead you wanted to remove all instances of ('$', '#', ','), you could do this with pyspark.sql.functions.regexp_replace().

df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show()
#+------------------+-------------+
#|                 A|     replaced|
#+------------------+-------------+
#|           $100,00|        10000|
#|           #foobar|       foobar|
#|foo, bar, #, and $|foo bar  and |
#+------------------+-------------+

The pattern "[\$#,]" means match any of the characters inside the brackets. The $ has to be escaped because it has a special meaning in regex.

answered Jun 8, 2018 at 19:04

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sheldore Over a year ago

For removing all instances, you can also use translate. Switching to regex_replace is not needed. I believe the following would do the job df.select("A", f.translate(f.col("A"), "$#,", "").alias("replaced")).show()

MuscleUpUp Over a year ago

@Sheldore, your solution does not work properly. It replaces characters with space

Nikunj Kakadiya · Accepted Answer · 2021-04-05 13:02:42Z

1

If someone need to do this in scala you can do this as below code:

val df = Seq(("Test$",19),("$#,",23),("Y#a",20),("ZZZ,,",21)).toDF("Name","age")
import org.apache.spark.sql.functions._
val df1 = df.withColumn("NewName",translate($"Name","$#,","xyz"))
display(df1)

You can see the output as below:

answered Apr 5, 2021 at 13:02

Nikunj Kakadiya

3,0762 gold badges26 silver badges45 bronze badges

Collectives™ on Stack Overflow

Pyspark removing multiple characters in a dataframe column

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related