Pyspark mapping regex

Question

I have a pyspark dataframe, with text column.

I wanted to map the values which with a regex expression.

    df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-RH', 'RH'))
    df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-FI, 'FI'))

Plus I wanted to map specifics values according to a dictionnary, I did the following (mapper is from create_map()):

     df = df.withColumn("mapped_col",mapper.getItem(F.col("action")))

Finaly the values which has not been mapped by the dictionnary or the regex expression, will be set null. I do not know how to do this part in accordance to the two others.

Is it possible to have like a dictionnary of regex expression so I can regroup the two 'functions'? {".*-RH": "RH", ".*FI" : "FI"}

Original Output Example

+-----------------------------+
|message                      |
+-----------------------------+
|GDF2009                      | 
|GDF2014                      |
|ADS-set                      |
|ADS-set                      |
|XSQXQXQSDZADAA5454546a45a4-FI|
|dadaccpjpifjpsjfefspolamml-FI|
|dqdazdaapijiejoajojp565656-RH|
|kijipiadoa
+-----------------------------+

Expected Output Example

+-----------------------------+-----------------------------+
|message                      |status|
+-----------------------------+-----------------------------+
|GDF2009                      | GDF
|GDF2014                      | GDF
|ADS/set                      | ADS
|ADS-set                      | ADS 
|XSQXQXQSDZADAA5454546a45a4-FI| FI
|dadaccpjpifjpsjfefspolamml-FI| FI
|dqdazdaapijiejoajojp565656-RH| RH
|kijipiadoa                   | null or ??

So first 4th line are mapped with a dict, and the other are mapped using regex. Unmapped are null or ?? Thank you,

Could you add input and expected output dataframe?

chlebek
– chlebek

2020-07-07 11:05:45 +00:00
Commented Jul 7, 2020 at 11:05 — chlebek
– chlebek, Commented Jul 7, 2020 at 11:05
I have edited my post, I hope it will help you

Pdeuxa
– Pdeuxa

2020-07-07 11:22:58 +00:00
Commented Jul 7, 2020 at 11:22 — Pdeuxa
– Pdeuxa, Commented Jul 7, 2020 at 11:22

chlebek · Accepted Answer · 2020-07-07 15:10:57Z

You can achieve it using contains function:

from pyspark.sql.types import StringType

df = spark.createDataFrame(
    ["GDF2009", "GDF2014", "ADS-set", "ADS-set", "XSQXQXQSDZADAA5454546a45a4-FI", "dadaccpjpifjpsjfefspolamml-FI",
     "dqdazdaapijiejoajojp565656-RH", "kijipiadoa"], StringType()).toDF("message")
df.show()

names = ("GDF", "ADS", "FI", "RH")

def c(col, names):
    return [f.when(f.col(col).contains(i), i).otherwise("") for i in names]

df.select("message", f.concat_ws("", f.array_remove(f.array(*c("message", names)), "")).alias("status")).show()

output:

+--------------------+
|             message|
+--------------------+
|             GDF2009|
|             GDF2014|
|             ADS-set|
|             ADS-set|
|XSQXQXQSDZADAA545...|
|dadaccpjpifjpsjfe...|
|dqdazdaapijiejoaj...|
|          kijipiadoa|
+--------------------+

+--------------------+------+
|             message|status|
+--------------------+------+
|             GDF2009|   GDF|
|             GDF2014|   GDF|
|             ADS-set|   ADS|
|             ADS-set|   ADS|
|XSQXQXQSDZADAA545...|    FI|
|dadaccpjpifjpsjfe...|    FI|
|dqdazdaapijiejoaj...|    RH|
|          kijipiadoa|      |
+--------------------+------+

Collectives™ on Stack Overflow

Pyspark mapping regex

Original Output Example

Expected Output Example

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Original Output Example

Expected Output Example

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related