Multiple regex replace together in scala

Question

I get as input to a function in scala a dataframe that has a column named vin. The column has values in the below format

1. UJ123QR8467
2. 0UJ123QR846
3. /UJ123QR8467
4. -UJ123QR8467

and so on.

The requirement is to clean the column vin based on the following rules.

1. replace **"/_-** as ""
2. replace first 0 as ""
3. if the value is more than 10 characters then make the value as NULL.

I would like to know is there any simplified way to achieve the above. I can only think of doing multiple .withcolumn during regex replace every time.

I upvoted because "I can only think of doing multiple withcolumn" and the answer suggests doing multiple withcolumn. — som-snytt
– som-snytt, Commented Jan 25, 2019 at 4:15

Leo C · Accepted Answer · 2019-01-24 17:44:20Z

1

I would combine all Regex related changes in a single transformation and the length condition in another, as shown below:

import org.apache.spark.sql.functions._

val df = Seq(
  "UJ123QR8467", "0UJ123QR846", "/UJ123QR8467",
  "-UJ123QR8467", "UJ0123QR84", "UJ123-QR_846"
).toDF("vin")

df.
  withColumn("vin2", regexp_replace($"vin", "^[0]|[/_-]", "")).
  withColumn("vin2", when(length($"vin2") <= 10, $"vin2")).
  show
// +------------+----------+
// |         vin|      vin2|
// +------------+----------+
// | UJ123QR8467|      null|
// | 00UJ123QR84|0UJ123QR84|
// |/UJ123QR8467|      null|
// |-UJ123QR8467|      null|
// |  UJ0123QR84|UJ0123QR84|
// |UJ123-QR_846|UJ123QR846|
// +------------+----------+

Note that I've slightly expanded the sample dataset to cover cases such as non-leading 0, [/_-].

answered Jan 24, 2019 at 17:44

Leo C

22.5k3 gold badges28 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Multiple regex replace together in scala

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related