1

Need to update a PySpark dataframe if the column contains the certain substring

for example:

df looks like

id      address
1       spring-field_garden
2       spring-field_lane
3       new_berry place

If the address column contains spring-field_ just replace it with spring-field.

Expected result:

id      address
1       spring-field
2       spring-field
3       new_berry place

Tried:

df = df.withColumn('address',F.regexp_replace(F.col('address'), 'spring-field_*', 'spring-field'))

Seems not working.

2 Answers 2

3

You can use like with when expression:

from pyspark.sql import functions as F

df = df.withColumn(
    'address',
    F.when(
        F.col('address').like('%spring-field_%'),
        F.lit('spring-field')
    ).otherwise(F.col('address'))
)
Sign up to request clarification or add additional context in comments.

Comments

0

You can use the following regex:

df.withColumn(
    'address',
    F.regexp_replace('address', r'.*spring-field.*', 'spring-field')
)

Alternatively you can use the method contains:

df.withColumn(
    'address',
    F.when(
        F.col('address').contains("spring-field"), "spring-field"
    ).otherwise(F.col('address'))
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.