4

I have the following dataframe in Pyspark

+----+-------+-----+                                                            
|name|subject|score|
+----+-------+-----+
| Tom|   math|   90|
| Tom|physics|   70|
| Amy|   math|   95|
+----+-------+-----+

I used collect_list and struct function from pyspark.sql.functions

df.groupBy('name').agg(collect_list(struct('subject', 'score')).alias('score_list'))

to get the following dataframe

+----+--------------------+
|name|          score_list|
+----+--------------------+
| Tom|[[math, 90], [phy...|
| Amy|        [[math, 95]]|
+----+--------------------+

My question is how can I transform the last column score_list into string and dump it into a csv file looks like

Tom     (math, 90) | (physics, 70)
Amy     (math, 95)

Appreciate for any help, thanks.

Update: Here is a similar question but it's not exactly the same because it goes directly from string to another string. In my case, I want to first transfer string to collect_list<struct> and finally stringify this collect_list<struct>.

4

2 Answers 2

8

Per your Update and comment, for Spark 2.4.0+, here is one way to stringify an array of structs with Spark SQL builtin functions: transform and array_join:

>>> df.printSchema()
root
 |-- name: string (nullable = true)
 |-- score_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- subject: string (nullable = true)
 |    |    |-- score: integer (nullable = true)

>>> df.show(2,0)
+----+---------------------------+
|name|score_list                 |
+----+---------------------------+
|Tom |[[math, 90], [physics, 70]]|
|Amy |[[math, 95]]               |
+----+---------------------------+

>>> df1.selectExpr(
        "name"
      , """
         array_join(
             transform(score_list, x -> concat('(', x.subject, ', ', x.score, ')'))
           , ' | '
         ) AS score_list
        """
).show(2,0)

+----+--------------------------+
|name|score_list                |
+----+--------------------------+
|Tom |(math, 90) | (physics, 70)|
|Amy |(math, 95)                |
+----+--------------------------+

Where:

  1. Use transform() to convert array of structs into array of strings. for each array element (the struct x), we use concat('(', x.subject, ', ', x.score, ')') to convert it into a string.
  2. Use array_join() to join all array elements(StringType) with | , this will return the final string
Sign up to request clarification or add additional context in comments.

Comments

4

The duplicates I linked don't exactly answer your question, since you're combining multiple columns. Nevertheless you can modify the solutions to fit your desired output quite easily.

Just replace the struct with concat_ws. Also use concat to add an opening and closing parentheses to get the output you desire.

from pyspark.sql.functions import concat, concat_ws, lit

df = df.groupBy('name')\
    .agg(
        concat_ws(
            " | ", 
            collect_list(
                concat(lit("("), concat_ws(", ", 'subject', 'score'), lit(")"))
            )
        ).alias('score_list')
    )
df.show(truncate=False)

#+----+--------------------------+
#|name|score_list                |
#+----+--------------------------+
#|Tom |(math, 90) | (physics, 70)|
#|Amy |(math, 95)                |
#+----+--------------------------+

Note that since the comma appears in the score_list column, this value will be quoted when you write to csv if you use the default arguments.

For example:

df.coalesce(1).write.csv("test.csv")

Would produce the following output file:

Tom,"(math, 90) | (physics, 70)"
Amy,"(math, 95)"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.