3

I have a pyspark dataframe:

number  |  matricule      
--------------------------------------------
1       |  ["AZ 1234", "1234", "00100"]                   
--------------------------------------------
23      |  ["1010", "12987"]                   
--------------------------------------------
56      |  ["AZ 98989", "22222", "98989"]                   
--------------------------------------------

In matricule array, I have duplicates values if I remove AZ String. I would like to remove "AZ" string then remove duplicates values in matricule array. Knowing that sometimes I have a space just after AZ, I should remove it also.

I did an udf:

def remove_AZ(A)
    for item in A:
        if item.startswith('AZ'):
            item.replace('AZ','')
udf_remove_AZ = F.udf(remove_AZ)
df = df.withColumn("AZ_2", udf_remove_AZ(df.matricule))

I got null in all AZ_2 column.

How can I remove the AZ from the each value in matricule array then removing the duplicates inside ? Thank you

2 Answers 2

7

For Spark 2.4+, you can use transform + array_distinct function like this:

t = "transform(matricule, x -> trim(regexp_replace(x, '^AZ', '')))"
df.withColumn("matricule", array_distinct(expr(t))).show(truncate=False) 

#+------+--------------+
#|number|matricule     |
#+------+--------------+
#|1     |[1234, 00100] |
#|23    |[1010, 12987] |
#|56    |[98989, 22222]|
#+------+--------------+

For each element of the array, using transform, we remove AZ characters from the beginning of the string using regexp_replace and trim the leading and trailing spaces if there are.

Sign up to request clarification or add additional context in comments.

Comments

4

Can you write your udf as:

def remove_az(array):
    array = [w.replace('AZ', '').strip() for w in array]
    return array

remove_az_udf = F.udf(remove_az)

df = df.withColumn("AZ_2", remove_az_udf(df.matricule))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.