0

This is similar to Pyspark: cast array with nested struct to string

But, the accepted answer is not working for my case, so asking here

|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
    |-- element: struct (containsNull = true)
          |-- Col2Sub: string (nullable = true)

Sample JSON

{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}

This gives result in a single column

import pyspark.sql.functions as F
df.selectExpr("EXPLODE(Col2) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
    +----------------+
    | Col2_concated  |
    +----------------+
    |foo,bar         |
    +----------------+

But, how to get a result or DataFrame like this

+-------+---------------+
|Col1   | Col2_concated |
+-------+---------------+
|abc123 |foo,bar        |
+-------+---------------+

EDIT: This solution gives the wrong result

df.selectExpr("Col1","EXPLODE(Col2) AS structCol").select("Col1", F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show() 


+-------+---------------+
|Col1   | Col2_concated |
+-------+---------------+
|abc123 |foo            |
+-------+---------------+
|abc123 |bar            |
+-------+---------------+
2
  • You can select "Col1", though it is not expression df.selectExpr("Col1","EXPLODE(Col2) AS structCol").select("Col1", F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show() Commented Mar 15, 2019 at 9:57
  • Sorry, this doesn't work, I added the reason in the question Commented Mar 15, 2019 at 15:43

1 Answer 1

1

Just avoid the explode and you are already there. All you need is the concat_ws function. This function concatenates multiple string columns with a given seperator. See example below:

from pyspark.sql import functions as F
j = '{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}'
df = spark.read.json(sc.parallelize([j]))

#printSchema tells us the column names we can use with concat_ws                                                                              
df.printSchema()

Output:

root
 |-- Col1: string (nullable = true)
 |-- Col2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Col2Sub: string (nullable = true)

The column Col2 is an array of Col2Sub and we can use this column name to get the desired result:

bla = df.withColumn('Col2', F.concat_ws(',', df.Col2.Col2Sub))

bla.show()
+------+-------+                                                                
|  Col1|   Col2|
+------+-------+
|abc123|foo,bar|
+------+-------+
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks!, you are a lifesaver :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.