0

I have two dataframes that I would like to join.

One dataframe is like this, where syscode_ntwrk is split up by dash.

spark.createDataFrame(
    [
        (1, '1234 - ESPN'), 
        (2, '1234 - ESPN'),
        (3, '963 - CNN'), 
        (4, '963 - CNN'),
    ],
    ['id', 'col1'] 
)

And the other is in this format, where syscode_ntwrk is concatenated together.

spark.createDataFrame(
    [
        (100, '1234ESPN'), 
        (297, '1234ESPN'),
        (3989, '963CNN'), 
        (478, '963CNN'),
    ],
    ['counts', 'col1'] 
)

Is there a way in the 2nd dataframe to create a new column to match the first dataframe for syscode_ntwrk? Syscode will always be a group of numbers, and ntwrk will always be a group of letters, so is there a regex to add a space dash space in between the two?

1 Answer 1

1

You can use regexp_extract to extract the groups and use concat_ws to convert the groups into what you want.

import pyspark.sql.functions as F

df = spark.createDataFrame(
    [
        (100, '1234ESPN'), 
        (297, '1234ESPN'),
        (3989, '963CNN'), 
        (478, '963CNN'),
    ],
    ['counts', 'col1'] 
)

df.select(
    F.concat_ws(
        ' - ',
        F.regexp_extract('col1', '(\d+)([a-zA-Z]+)', 1),
        F.regexp_extract('col1', '(\d+)([a-zA-Z]+)', 2)
    ).alias('parsed')
).show()

+-----------+
|     parsed|
+-----------+
|1234 - ESPN|
|1234 - ESPN|
|  963 - CNN|
|  963 - CNN|
+-----------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.