Create Column by parsing string from another column

Question

I have two dataframes that I would like to join.

One dataframe is like this, where syscode_ntwrk is split up by dash.

spark.createDataFrame(
    [
        (1, '1234 - ESPN'), 
        (2, '1234 - ESPN'),
        (3, '963 - CNN'), 
        (4, '963 - CNN'),
    ],
    ['id', 'col1'] 
)

And the other is in this format, where syscode_ntwrk is concatenated together.

spark.createDataFrame(
    [
        (100, '1234ESPN'), 
        (297, '1234ESPN'),
        (3989, '963CNN'), 
        (478, '963CNN'),
    ],
    ['counts', 'col1'] 
)

Is there a way in the 2nd dataframe to create a new column to match the first dataframe for syscode_ntwrk? Syscode will always be a group of numbers, and ntwrk will always be a group of letters, so is there a regex to add a space dash space in between the two?

mck · Accepted Answer · 2020-11-17 19:53:26Z

1

You can use regexp_extract to extract the groups and use concat_ws to convert the groups into what you want.

import pyspark.sql.functions as F

df = spark.createDataFrame(
    [
        (100, '1234ESPN'), 
        (297, '1234ESPN'),
        (3989, '963CNN'), 
        (478, '963CNN'),
    ],
    ['counts', 'col1'] 
)

df.select(
    F.concat_ws(
        ' - ',
        F.regexp_extract('col1', '(\d+)([a-zA-Z]+)', 1),
        F.regexp_extract('col1', '(\d+)([a-zA-Z]+)', 2)
    ).alias('parsed')
).show()

+-----------+
|     parsed|
+-----------+
|1234 - ESPN|
|1234 - ESPN|
|  963 - CNN|
|  963 - CNN|
+-----------+

answered Nov 17, 2020 at 19:53

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Create Column by parsing string from another column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related