Pyspark: Split and select part of the string column values

Question

How can I select the characters or file path after the Dev\” and dev\ from the column in a spark DF?

Sample rows of the pyspark column:

\\D\Dev\johnny\Desktop\TEST
\\D\Dev\matt\Desktop\TEST\NEW
\\D\Dev\matt\Desktop\TEST\OLD\TEST
\\E\dev\peter\Desktop\RUN\SUBFOLDER\New

Expected Output

johnny\Desktop\TEST
matt\Desktop\TEST\NEW
matt\Desktop\TEST\OLD\TEST
peter\Desktop\RUN\SUBFOLDER\New

I tried to use the code below.

df = df.withColumn(
        "sub_path",
        F.element_at(F.split(F.col("path"), "Dev\\\\"), -1)
    )

It's only giving the part correct results that I want. Appreciate someone can help.

ggordon · Accepted Answer · 2021-09-02 02:34:59Z

1

The following modification [Dd] matches both upper and lower case d.

df = df.withColumn(
        "sub_path",
        F.element_at(F.split(F.col("path"), "[Dd]ev\\\\"), -1)
    )

Let me know if this works for you.

answered Sep 2, 2021 at 2:34

ggordon

10.1k2 gold badges18 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Leonard Over a year ago

Thank you for the answer. Is there a way to choose by the number of backslashes? In this case, after 4th backslash, we choose the rest of the string? When we have large numbers of rows with different charactors.

ggordon Over a year ago

Please add this as another question with sample data and expected results to test possible solutions so that we may look at this also.

Leonard Over a year ago

I have added a new question. stackoverflow.com/questions/69024095/… Thank you for the support

Collectives™ on Stack Overflow

Pyspark: Split and select part of the string column values

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related