Replace comma only if the followed by integer in pyspark column

Question

values=[("3","100;PerMonth;BB;1500;Tm;TkU,2500;Trm;TU"),("4","100;CalendarDay;g;440;Term;Degram")]
df=spark.createDataFrame(values,['id','derivate'])

I want to change the comma to pipe inside a column only if the comma is before an integer pyspark

input

|id  |derivate                                   |
+---+--------------------------------------------+
|3  |100;PerMonth;BB;1500;Tm;TkU,2500;Trm;TU     |
|4  |100;CalendarDay;g;440;Term;Degram           |
+---+--------------------------------------------+

expected output

|id |derivate                                    |ITEMS                                       |
+---+--------------------------------------------+--------------------------------------------+
|3  |100;PerMonth;BBL;1500;Term;TkU,2500;Term;TEU|100;PerMonth;BBL|1500;Term;TkU|2500;Term;TEU|
|4  |100;CalendarDay;g;440;Term;Degram           |100;CalendarDay;g|440;Term;Degram           |

blackbishop · Accepted Answer · 2022-02-22 15:25:15Z

1

You can use regexp_replace function with this regex [;,](?=\d+) to match all commas and semi-colons that are followed by digit:

from pyspark.sql import functions as F

df.withColumn(
    "ITEMS",
    F.regexp_replace(F.col("derivate"), "[;,](?=\\d+)", "|")
).show(truncate=False)

#+---+---------------------------------------+---------------------------------------+
#|id |derivate                               |ITEMS                                  |
#+---+---------------------------------------+---------------------------------------+
#|3  |100;PerMonth;BB;1500;Tm;TkU,2500;Trm;TU|100;PerMonth;BB|1500;Tm;TkU|2500;Trm;TU|
#|4  |100;CalendarDay;g;440;Term;Degram      |100;CalendarDay;g|440;Term;Degram      |
#+---+---------------------------------------+---------------------------------------+

answered Feb 22, 2022 at 15:25

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

gowtham natrajan Over a year ago

thanks but can I add | instead of replacing

Collectives™ on Stack Overflow

Replace comma only if the followed by integer in pyspark column

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related