0

I am working on pyspark dataframe and I have a column of words (array<string> type). What should be the regex pattern to remove numeric values and numeric values from words?

+---+----------------------------------------------+
|id |    words                                     |
+---+----------------------------------------------+
|564|[fhbgtrj5, 345gjhg, ghth578ghu, 5897, fhrfu44]|
+---+----------------------------------------------+

expected output:

+---+----------------------------------------------+
|id |words                                         |
+---+----------------------------------------------+
|564|               [fhbgtrj, gjhg, ghthghu, fhrfu]|
+---+----------------------------------------------+

Please help.

2
  • Does this answer your question? Delete digits in Python (Regex) Commented Mar 25, 2021 at 22:27
  • @jbflow thanks for looking into it. the references you shared certainly removes numbers but another aim is to keep alphabets from alphanumeric Commented Mar 25, 2021 at 22:32

1 Answer 1

1

You can use transform together with regexp_replace to remove the numbers, and use array_remove to remove the empty entries (which comes from those entries which only consist of numbers).

df2 = df.withColumn(
    'words', 
    F.expr("array_remove(transform(words, x -> regexp_replace(x, '[0-9]', '')), '') as words")
)

df2.show(truncate=False)
+---+-------------------------------+
|id |words                          |
+---+-------------------------------+
|564|[fhbgtrj, gjhg, ghthghu, fhrfu]|
+---+-------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.