PySpark - Convert Array Struct to Column Name the my Struct

Question

I'm trying to convert my struct array by columns of my array, where I would like to separate the fields by ;

My data is this way

root
 |-- id_payment: string (nullable = true)
 |-- contents_json: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- id_seller: string (nullable = true)
 |    |    |-- tot_product: string (nullable = true)
 |    |    |-- tot_unit: long (nullable = true)
 |    |    |-- prizes: string (nullable = true)

+--------------------------------------------+
|contents_json                               |
+--------------------------------------------+
|[{7, 1, 0, 0}, {5, 1, 0, 0}, {5, 1, 0, 0}]  |
|[{7, 1, 0, 0}]                              |
+--------------------------------------------+

I would like to leave it like this

+-----------------------------------------------+
| id_seller| tot_product | tot_unit  | prizes   |
+-----------------------------------------------+
| 7;5;7    | 1;12;5      | 0;1;0     | 0;0;1    |
| 10       | 1           | 0         | 0        |
+-----------------------------------------------+

But I couldn't do this with ws_concat

I tried with this code, but it didn't work. What am I doing wrong

I tried this code, but instead of concatenating it broke into two lines. What I'm trying to do is to concatenate in the same cell these lines val1;val2;val3 and so on. How can I fix this?

sparkdf2.selectExpr("explode(contents_json) AS structCol").select(F.expr("concat_ws(',', structCol.id_campaign)").alias("id_campaign")).show()

where are all these numbers coming from ? contents_json and expected ouput do not seem to contain the same data. how do you go from one to another ? please explain mapping logic — Steven
– Steven, Commented Oct 20, 2021 at 13:09

Steven · Accepted Answer · 2021-10-20 13:18:19Z

well, if you just select the columns, isn't it enough ?

df.show()
+----------+------------------------------------------+
|id_payment|contents_json                             |
+----------+------------------------------------------+
|1         |[[7, 1, 0, 0], [5, 1, 0, 0], [5, 1, 0, 0]]|
|2         |[[7, 1, 0, 0]]                            |
+----------+------------------------------------------+

df.select(
    "id_payment",
    "contents_json.id_seller",
    "contents_json.tot_product",
    "contents_json.tot_unit",
    "contents_json.prizes",
).show()
+----------+---------+-----------+---------+---------+
|id_payment|id_seller|tot_product| tot_unit|   prizes|
+----------+---------+-----------+---------+---------+
|         1|[7, 5, 5]|  [1, 1, 1]|[0, 0, 0]|[0, 0, 0]|
|         2|      [7]|        [1]|      [0]|      [0]|
+----------+---------+-----------+---------+---------+

You can use concat_ws if needed :

df.select(
    "id_payment",
    F.concat_ws(";","contents_json.id_seller").alias("id_seller"),
    "contents_json.tot_product",
    "contents_json.tot_unit",
    "contents_json.prizes",
).show()
+----------+---------+-----------+---------+---------+
|id_payment|id_seller|tot_product| tot_unit|   prizes|
+----------+---------+-----------+---------+---------+
|         1|    7;5;5|  [1, 1, 1]|[0, 0, 0]|[0, 0, 0]|
|         2|        7|        [1]|      [0]|      [0]|
+----------+---------+-----------+---------+---------+

Collectives™ on Stack Overflow

PySpark - Convert Array Struct to Column Name the my Struct

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related