1

I have a table like this:

a |  a_vector                               |  
1 | 710.83;-776.98;-10.86;2013.02;-896.28;  |  
2 | 3  ; 2  ;  1                            | 

Using PySpark/pandas, how do I dynamically create columns so that first values in vector go to "col1" and second values go to "col2" etc. + calculate the sum?

a |  a_vector                               |  col1 | col2   | col3
1 | 300;-200;2022;                          |  300  |  -200  | 2022
2 | 3  ; 2  ;  1                            |   3   |  2     | 1

The final requirement is to have the sums of new columns sorted in one column.

2
  • is your end objective just to add the values? or would you need the new columns for other purposes as well? Commented Nov 8, 2022 at 13:54
  • is it a pandas or pyspark question ? cause that's not the same thing at all. Commented Nov 8, 2022 at 14:01

2 Answers 2

2

In PySpark, you can do it by first splitting your string on ; (creating an array) and then selecting columns using list comprehension. The sum can be calculated using aggregate higher-order function.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('1', '300;-200;2022'),
     ('2', '3  ; 2  ;   1')],
    ['a', 'a_vector']
)

Script:

col_arr = F.split('a_vector', '\s*;\s*')
max_size = df.agg(F.max(F.size(col_arr))).head()[0]
df = df.select(
    '*',
    *[col_arr[i].alias(f'col{i}') for i in range(max_size)],
    F.aggregate(col_arr, F.lit(0.0), lambda acc, x: acc + x).alias('sum')
)

df.show()
# +---+-------------+----+----+----+------+
# |  a|     a_vector|col0|col1|col2|   sum|
# +---+-------------+----+----+----+------+
# |  1|300;-200;2022| 300|-200|2022|2122.0|
# |  2|3  ; 2  ;   1|   3|   2|   1|   6.0|
# +---+-------------+----+----+----+------+

Based on the comments, the following is probably what you need. Calculating max_size of array elements, calculating column-wise sums and sorting the results.

col_arr = F.split('a_vector', '\s*;\s*')
max_size = df.agg(F.max(F.size(col_arr))).head()[0]
df = df.agg(F.array_sort(F.array([F.sum(col_arr[i]) for i in range(max_size)])).alias('sum'))

df.show(truncate=0)
# +-----------------------+
# |sum                    |
# +-----------------------+
# |[-198.0, 303.0, 2023.0]|
# +-----------------------+
Sign up to request clarification or add additional context in comments.

Comments

1

with names

dfs = df['a_vector'].str.split(';', expand=True).rename(columns = lambda x: "col"+str(x+1))
df =pd.concat([df, dfs], axis=1)
print(df)

inputs

   a       a_vector
0  1  300;-200;2022
1  2          3;2;1

Output

   a       a_vector col1  col2  col3
0  1  300;-200;2022  300  -200  2022
1  2          3;2;1    3     2     1

Note: Not tested..Let me know if deosn't work

2 Comments

Thanks, how could I sum col1,col2,col3 dynamically too I ask after I have second output? newdf = df.sum(columns = lambda x: "col"+str(x+1))
You mean you want to sum col1,col2,col3 ?...Like 300+(-200)+2022 = 2122?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.