4

I have a pyspark Dataframe, that contain 4 columns. I want to extract some string from one column, it's type is Array of strings. I used regexp_extract function, but it's returned an error because the regexp_extract accept only a strings.

example dataframe:

id |  last_name | age | Identificator
------------------------------------------------------------------
12 | AA         | 23  |  "[""AZE","POI","76759","T86420","ADAPT"]"
------------------------------------------------------------------
24 | BB         | 24  | "[""SDN","34","35","AZE","21054","20126"]"
------------------------------------------------------------------

I want to extract all numbers that:

- contain 4, 5 or 6 digits
 - it should not attached to a letters.
 - if attached to letter Z ok, I should extract it.
 - save it in a new column in my Dataframe.

I started to do it like this but it doesn't work because the title is an array of string.

expression = r'([0-9]){4,6}'
 df = df.withColumn("extract", F.regexp_extract(F.col("Identificator"), expression, 1))

How can I extract these numbers using regexp_extract or another solution ? Thank you

1 Answer 1

8

Here is what I can do using SparkSQL 2.4.0+ builtin function filter:

from pyspark.sql.functions import expr

df.withColumn('text_new', expr('filter(text, x -> x rlike "^Z?[0-9]{4,6}$")')) \
  .show(truncate=False)                                                                          
#+-----------------------------------+---------------------+
#|text                               |text_new             |
#+-----------------------------------+---------------------+
#|[AZE, POI, 76759, T86420, ADAPT]   |[76759]              |
#|[SDN, 34, Z8735, AZE, 21054, 20126]|[Z8735, 21054, 20126]|
#+-----------------------------------+---------------------+

The result is an array containing matched items. the regex ^Z?[0-9]{4,6}$ matches 4-6 digits optionally preceded by a character 'Z'.

Edit: for older version Apache Spark, use udf():

import re
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType

# regex pattern:
ptn = re.compile('^Z?[0-9]{4,6}$')

# create an udf to filter array
array_filter = udf(lambda arr: [ x for x in arr if re.match(ptn, x) ] if type(arr) is list else arr, ArrayType(StringType()))

df.withColumn('text_new', array_filter('text')) \
  .show(truncate=False)

Edit-2: base on your comment, from 'Z' to 'MOD' and remove the leading MOD, use lstrip() to remove this substring. adjust the following:

ptn = re.complie(r'^(?:MOD)?[0-9]{4,6}$')

array_filter = udf(lambda arr: [ x.lstrip('MOD') for x in arr if re.match(ptn, x) ] if type(arr) is list else arr, ArrayType(StringType()))
Sign up to request clarification or add additional context in comments.

9 Comments

@verojoucla lastest version is 2.4.4 !
@verojoucla can you test this on a spark 2.4.4 cluster? it would be more practical to test a code under which can reflect to your production environment. :) spark 3.0 is unstable with many uncertainty.
did you mean: spark 2.1.1 with Scala 2.11? it's an old version of Spark, right? if so, i would use a udf for this task.
@verojoucla I will correct the word from 2.40 to 2.4.0 in my post. I think this might raise some misunderstanding.
that means, some of the column values are not array, most likely None. how do you want to process such cases?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.