Replace string if it contains certain substring in PySpark

Question

Need to update a PySpark dataframe if the column contains the certain substring

for example:

df looks like

id      address
1       spring-field_garden
2       spring-field_lane
3       new_berry place

If the address column contains spring-field_ just replace it with spring-field.

Expected result:

id      address
1       spring-field
2       spring-field
3       new_berry place

Tried:

df = df.withColumn('address',F.regexp_replace(F.col('address'), 'spring-field_*', 'spring-field'))

Seems not working.

blackbishop · Accepted Answer · 2021-02-18 23:09:15Z

3

You can use like with when expression:

from pyspark.sql import functions as F

df = df.withColumn(
    'address',
    F.when(
        F.col('address').like('%spring-field_%'),
        F.lit('spring-field')
    ).otherwise(F.col('address'))
)

edited Feb 18, 2021 at 23:09

answered Feb 18, 2021 at 0:58

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mykola Zotko · Accepted Answer · 2021-02-18 07:59:52Z

0

You can use the following regex:

df.withColumn(
    'address',
    F.regexp_replace('address', r'.*spring-field.*', 'spring-field')
)

Alternatively you can use the method contains:

df.withColumn(
    'address',
    F.when(
        F.col('address').contains("spring-field"), "spring-field"
    ).otherwise(F.col('address'))
)

edited Feb 18, 2021 at 7:59

answered Feb 18, 2021 at 7:49

Mykola Zotko

18.2k6 gold badges88 silver badges90 bronze badges

Collectives™ on Stack Overflow

Replace string if it contains certain substring in PySpark

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related