I have a pyspark Dataframe, that contain 4 columns. I want to extract some string from one column, it's type is Array of strings.
I used regexp_extract function, but it's returned an error because the regexp_extract accept only a strings.
example dataframe:
id | last_name | age | Identificator
------------------------------------------------------------------
12 | AA | 23 | "[""AZE","POI","76759","T86420","ADAPT"]"
------------------------------------------------------------------
24 | BB | 24 | "[""SDN","34","35","AZE","21054","20126"]"
------------------------------------------------------------------
I want to extract all numbers that:
- contain 4, 5 or 6 digits
- it should not attached to a letters.
- if attached to letter Z ok, I should extract it.
- save it in a new column in my Dataframe.
I started to do it like this but it doesn't work because the title is an array of string.
expression = r'([0-9]){4,6}'
df = df.withColumn("extract", F.regexp_extract(F.col("Identificator"), expression, 1))
How can I extract these numbers using regexp_extract or another solution ? Thank you