Pyspark: Select part of the string(file path) column values

Question

Pyspark: Split and select part of the string column values

How can I select the characters or file path after the 4th(from left) backslash from the column in a spark DF?

Sample rows of the pyspark column:

\\D\Dev\johnny\Desktop\TEST
\\D\Dev\matt\Desktop\TEST\NEW
\\D\Dev\matt\Desktop\TEST\OLD\TEST
\\E\dev\peter\Desktop\RUN\SUBFOLDER\New
\\K924\prod\ums\Desktop\RUN\SUBFOLDER\New
\\LE345\jskx\rfk\Desktop\RUN\SUBFOLDER\New
.
.
.
\\ls53\f7sn3\vso\hsk\mwq\sdsf\kse

Expected Output

johnny\Desktop\TEST
matt\Desktop\TEST\NEW
matt\Desktop\TEST\OLD\TEST
peter\Desktop\RUN\SUBFOLDER\New
ums\Desktop\RUN\SUBFOLDER\New
rfk\Desktop\RUN\SUBFOLDER\New
.
.
.
vso\hsk\mwq\sdsf\kse

My previous question led to this new question. Appreciate any help.

ggordon · Accepted Answer · 2021-09-02 05:55:19Z

0

You may use a regular expression in regexp_replace eg.

from pyspark.sql import functions as F

df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\[a-zA-Z0-9]+\\\\[a-zA-Z0-9]+\\\\",""))

you may also be more flexible with this solution eg.

from pyspark.sql import functions as F
no_of_slashes=4 # number of slashes to consider here

# we build the regular expression by repeating `"[a-zA-Z0-9]+\\\\"`
# NB. We subtract 2 since we start with the frst 2 slashes
df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\"+("[a-zA-Z0-9]+\\\\"*(no_of_slashes-2)),""))

Let me know if this works for you.

edited Sep 2, 2021 at 5:55

answered Sep 2, 2021 at 5:44

ggordon

10.1k2 gold badges18 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark: Select part of the string(file path) column values

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related