1

I have the following dataframe with a column sig and N other columns.

sig contains N number of column embedded into it as shown below. The embedded column names can be of any number present in the dataframe.

I want to update the sig column with the corresponding values from the other columns.

For example,

+---------------------------------------------------------------------+------------+------------------+-------------------+--------+
|sig                                                                  |order_timing|po_manl_create_ind|mabd_arrival_status|cut_time|
+---------------------------------------------------------------------+------------+------------------+-------------------+--------+
|R1:BR1-order_timing:BR2-po_manl_create_ind:BR3-mabd_arrival_status:R1|14          |0                 |late               |23      |
|R1:BR1-order_timing:BR2-po_manl_create_ind:BR7-cut_time:R1           |14          |0                 |on_time            |10      |

Expected output

+---------------------------------------------------------------------+------------+-----
|sig                        |order_timing|po_manl_create_ind|mabd_arrival_status|cut_time|
+---------------------------------------------------------------------+------------+-----
|R1:BR1-14:BR2-0:BR3-late:R1|14          |0                 |late               |23      |
|R1:BR1-14:BR2-0:BR7-10:R1  |14          |0                 |on_time            |10      |

1 Answer 1

1

One way is to chain multiple replace expressions by using the list of columns likely to be present in sig values.

Using this sample DF:

val df = Seq(
   ("R1:BR1-order_timing:BR2-po_manl_create_ind:BR3-mabd_arrival_status:R1", 14, 0, "late", 23),
   ("R1:BR1-order_timing:BR2-po_manl_create_ind:BR7-cut_time:R1", 14, 0, "on_time", 10),
).toDF("sig", "order_timing", "po_manl_create_ind", "mabd_arrival_status", "cut_time")

You can generate the replacement expression replace_expr using foldLeft like this:

val replace_expr = df.columns
  .filter(_ != "sig")
  .foldLeft("sig")((acc, c) => s"replace($acc, '$c', $c)")

df.withColumn("sig", expr(replace_expr)).show(false)

//+---------------------------+------------+------------------+-------------------+--------+
//|sig                        |order_timing|po_manl_create_ind|mabd_arrival_status|cut_time|
//+---------------------------+------------+------------------+-------------------+--------+
//|R1:BR1-14:BR2-0:BR3-late:R1|14          |0                 |late               |23      |
//|R1:BR1-14:BR2-0:BR7-10:R1  |14          |0                 |on_time            |10      |
//+---------------------------+------------+------------------+-------------------+--------+
Sign up to request clarification or add additional context in comments.

2 Comments

Awesome, looks like magic! Only issue is entire sig value becomes NULL if any of the column contains NULL.
@Shaun you can use nvl function to handle cases when replacement value is null. Simply modify the expression to: s"replace($acc, '$c', nvl($c, ''))" this will put empty string in case of null.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.