0

I'm trying to join two tables based on a common ID, but there's a mismatch in dates across these files which I'm trying to normalise.

Given this data:

+-------+-------------------+----------------------------+
|dataset|id                 |topic                       |
+-------+-------------------+----------------------------+
|2020A  |1128290566331031552|papuaNewguineaEarthquake2019|
|2020A  |1128293303659716608|papuaNewguineaEarthquake2019|
|2020A  |1152200235847966726|athensEarthquake2019        |
|2020A  |1152204892083281920|athensEarthquake2019        |
|2020A  |1152220394008522753|athensEarthquake2019        |
+-------+-------------------+----------------------------+

How would I, for example, replace the 2019 in papuaNewguineaEarthquake2019 with the first four numbers of the value in the dataset column so that:

papuaNewguineaEarthquake2019 becomes papuaNewguineaEarthquake2020?

In other words, how do I use regex to replace a subgroup in one column with a subgroup in another column?

1 Answer 1

2

You can use the expr function.

I'm using regexp_extract to extract the first 4 digits from the dataset column and regexp_replace to replace the last 4 digits of the topic column with the output of regexp_extract.

Regex for first 4 digits: (^[0-9]{4})
Regex for last 4 digits: ([0-9]{4}$)

from pyspark.sql.functions import expr

df.withColumn("dataset_year",expr("regexp_extract(dataset, '(^[0-9]{4})')"))\
    .withColumn("topic",expr("regexp_replace(topic, '([0-9]{4}$)'\
    , dataset_year)")).drop('dataset_year').show(truncate=False)

+-------+-------------------+----------------------------+
|dataset|id                 |topic                       |
+-------+-------------------+----------------------------+
|2020A  |1128290566331031552|papuaNewguineaEarthquake2020|
|2020A  |1128293303659716608|papuaNewguineaEarthquake2020|
|2020A  |1152200235847966726|athensEarthquake2020        |
|2020A  |1152204892083281920|athensEarthquake2020        |
|2020A  |1152220394008522753|athensEarthquake2020        |
+-------+-------------------+----------------------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.