0

I get as input to a function in scala a dataframe that has a column named vin. The column has values in the below format

1. UJ123QR8467
2. 0UJ123QR846
3. /UJ123QR8467
4. -UJ123QR8467

and so on.

The requirement is to clean the column vin based on the following rules.

1. replace **"/_-** as ""
2. replace first 0 as ""
3. if the value is more than 10 characters then make the value as NULL.

I would like to know is there any simplified way to achieve the above. I can only think of doing multiple .withcolumn during regex replace every time.

1
  • I upvoted because "I can only think of doing multiple withcolumn" and the answer suggests doing multiple withcolumn. Commented Jan 25, 2019 at 4:15

1 Answer 1

1

I would combine all Regex related changes in a single transformation and the length condition in another, as shown below:

import org.apache.spark.sql.functions._

val df = Seq(
  "UJ123QR8467", "0UJ123QR846", "/UJ123QR8467",
  "-UJ123QR8467", "UJ0123QR84", "UJ123-QR_846"
).toDF("vin")

df.
  withColumn("vin2", regexp_replace($"vin", "^[0]|[/_-]", "")).
  withColumn("vin2", when(length($"vin2") <= 10, $"vin2")).
  show
// +------------+----------+
// |         vin|      vin2|
// +------------+----------+
// | UJ123QR8467|      null|
// | 00UJ123QR84|0UJ123QR84|
// |/UJ123QR8467|      null|
// |-UJ123QR8467|      null|
// |  UJ0123QR84|UJ0123QR84|
// |UJ123-QR_846|UJ123QR846|
// +------------+----------+

Note that I've slightly expanded the sample dataset to cover cases such as non-leading 0, [/_-].

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.