How use on Array

Question

I have a pyspark Dataframe, that contain 4 columns. I want to extract some string from one column, it's type is Array of strings. I used regexp_extract function, but it's returned an error because the regexp_extract accept only a strings.

example dataframe:

id |  last_name | age | Identificator
------------------------------------------------------------------
12 | AA         | 23  |  "[""AZE","POI","76759","T86420","ADAPT"]"
------------------------------------------------------------------
24 | BB         | 24  | "[""SDN","34","35","AZE","21054","20126"]"
------------------------------------------------------------------

I want to extract all numbers that:

- contain 4, 5 or 6 digits
 - it should not attached to a letters.
 - if attached to letter Z ok, I should extract it.
 - save it in a new column in my Dataframe.

I started to do it like this but it doesn't work because the title is an array of string.

expression = r'([0-9]){4,6}'
 df = df.withColumn("extract", F.regexp_extract(F.col("Identificator"), expression, 1))

How can I extract these numbers using regexp_extract or another solution ? Thank you

jxc · Accepted Answer · 2019-10-14 14:52:42Z

8

Here is what I can do using SparkSQL 2.4.0+ builtin function filter:

from pyspark.sql.functions import expr

df.withColumn('text_new', expr('filter(text, x -> x rlike "^Z?[0-9]{4,6}$")')) \
  .show(truncate=False)                                                                          
#+-----------------------------------+---------------------+
#|text                               |text_new             |
#+-----------------------------------+---------------------+
#|[AZE, POI, 76759, T86420, ADAPT]   |[76759]              |
#|[SDN, 34, Z8735, AZE, 21054, 20126]|[Z8735, 21054, 20126]|
#+-----------------------------------+---------------------+

The result is an array containing matched items. the regex ^Z?[0-9]{4,6}$ matches 4-6 digits optionally preceded by a character 'Z'.

Edit: for older version Apache Spark, use udf():

import re
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType

# regex pattern:
ptn = re.compile('^Z?[0-9]{4,6}$')

# create an udf to filter array
array_filter = udf(lambda arr: [ x for x in arr if re.match(ptn, x) ] if type(arr) is list else arr, ArrayType(StringType()))

df.withColumn('text_new', array_filter('text')) \
  .show(truncate=False)

Edit-2: base on your comment, from 'Z' to 'MOD' and remove the leading MOD, use lstrip() to remove this substring. adjust the following:

ptn = re.complie(r'^(?:MOD)?[0-9]{4,6}$')

array_filter = udf(lambda arr: [ x.lstrip('MOD') for x in arr if re.match(ptn, x) ] if type(arr) is list else arr, ArrayType(StringType()))

edited Oct 14, 2019 at 14:52

answered Oct 14, 2019 at 12:05

jxc

14k4 gold badges20 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Steven Over a year ago

@verojoucla lastest version is 2.4.4 !

jxc Over a year ago

@verojoucla can you test this on a spark 2.4.4 cluster? it would be more practical to test a code under which can reflect to your production environment. :) spark 3.0 is unstable with many uncertainty.

jxc Over a year ago

did you mean: spark 2.1.1 with Scala 2.11? it's an old version of Spark, right? if so, i would use a udf for this task.

jxc Over a year ago

@verojoucla I will correct the word from 2.40 to 2.4.0 in my post. I think this might raise some misunderstanding.

jxc Over a year ago

that means, some of the column values are not array, most likely None. how do you want to process such cases?

|

Collectives™ on Stack Overflow

How use on Array

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related