Dynamically create columns from string with delimiter in Spark

Question

I have a table like this:

a |  a_vector                               |  
1 | 710.83;-776.98;-10.86;2013.02;-896.28;  |  
2 | 3  ; 2  ;  1                            |

Using PySpark/pandas, how do I dynamically create columns so that first values in vector go to "col1" and second values go to "col2" etc. + calculate the sum?

a |  a_vector                               |  col1 | col2   | col3
1 | 300;-200;2022;                          |  300  |  -200  | 2022
2 | 3  ; 2  ;  1                            |   3   |  2     | 1

The final requirement is to have the sums of new columns sorted in one column.

is your end objective just to add the values? or would you need the new columns for other purposes as well? — samkart
– samkart, Commented Nov 8, 2022 at 13:54
is it a pandas or pyspark question ? cause that's not the same thing at all. — Steven
– Steven, Commented Nov 8, 2022 at 14:01

ZygD · Accepted Answer · 2022-11-08 15:34:02Z

In PySpark, you can do it by first splitting your string on ; (creating an array) and then selecting columns using list comprehension. The sum can be calculated using aggregate higher-order function.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('1', '300;-200;2022'),
     ('2', '3  ; 2  ;   1')],
    ['a', 'a_vector']
)

Script:

col_arr = F.split('a_vector', '\s*;\s*')
max_size = df.agg(F.max(F.size(col_arr))).head()[0]
df = df.select(
    '*',
    *[col_arr[i].alias(f'col{i}') for i in range(max_size)],
    F.aggregate(col_arr, F.lit(0.0), lambda acc, x: acc + x).alias('sum')
)

df.show()
# +---+-------------+----+----+----+------+
# |  a|     a_vector|col0|col1|col2|   sum|
# +---+-------------+----+----+----+------+
# |  1|300;-200;2022| 300|-200|2022|2122.0|
# |  2|3  ; 2  ;   1|   3|   2|   1|   6.0|
# +---+-------------+----+----+----+------+

Based on the comments, the following is probably what you need. Calculating max_size of array elements, calculating column-wise sums and sorting the results.

col_arr = F.split('a_vector', '\s*;\s*')
max_size = df.agg(F.max(F.size(col_arr))).head()[0]
df = df.agg(F.array_sort(F.array([F.sum(col_arr[i]) for i in range(max_size)])).alias('sum'))

df.show(truncate=0)
# +-----------------------+
# |sum                    |
# +-----------------------+
# |[-198.0, 303.0, 2023.0]|
# +-----------------------+

Bhargav · Accepted Answer · 2022-11-08 14:01:51Z

1

with names

dfs = df['a_vector'].str.split(';', expand=True).rename(columns = lambda x: "col"+str(x+1))
df =pd.concat([df, dfs], axis=1)
print(df)

inputs

   a       a_vector
0  1  300;-200;2022
1  2          3;2;1

Output

   a       a_vector col1  col2  col3
0  1  300;-200;2022  300  -200  2022
1  2          3;2;1    3     2     1

Note: Not tested..Let me know if deosn't work

edited Nov 8, 2022 at 14:01

answered Nov 8, 2022 at 13:54

Bhargav

4,8912 gold badges12 silver badges29 bronze badges

2 Comments

lunbox Over a year ago

Thanks, how could I sum col1,col2,col3 dynamically too I ask after I have second output? newdf = df.sum(columns = lambda x: "col"+str(x+1))

Bhargav Over a year ago

You mean you want to sum col1,col2,col3 ?...Like 300+(-200)+2022 = 2122?

Collectives™ on Stack Overflow

Dynamically create columns from string with delimiter in Spark

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related