PySpark Dataframe extract column as an array

Question

Is it possible to extract all of the rows of a specific column to a container of type array?

I want to be able to extract it and then reshape it as an array. Currently, the column type that I am trying to extract is of type udt.

I tried to use

my_array =  df.select(df['my_col'])

but this is not correct as it gives me a list

To clarify, you are not looking for a Python list, but something like pyspark.sql.types.ArrayType? — 0x5453
– 0x5453, Commented Nov 2, 2021 at 20:46
Sorry for the confusion. Yes you are correct. I need to be able to reshape() it so that I can pass it into function — Morello
– Morello, Commented Nov 2, 2021 at 20:49

greenie · Accepted Answer · 2021-11-02 21:42:55Z

2

collect_list() gives you an array of values.

A. If you want to collect all the values of a column say c2, based on another column say c1, you can group by c1 and collect values of c2 using collect_list.

df = spark.createDataFrame([
    ('emma', 'math'),
    ('emma', 'english'),
    ('mia','english'),
    ('mia','science'),
   ('mona','math'),
   ('mona','geography')
], ["student", "subject"])

from pyspark.sql.functions import collect_list
df1=df.groupBy('student').agg(collect_list('subject'))
df1.show()

B. If you want all values of c2 irrespective of any other column, you can group by a literal:

from pyspark.sql.functions import lit

df1=df.groupBy(lit(1)).agg(collect_list('subject'))
df1.show()

answered Nov 2, 2021 at 21:42

greenie

4443 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark Dataframe extract column as an array

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related