1

Is it possible to extract all of the rows of a specific column to a container of type array?

I want to be able to extract it and then reshape it as an array. Currently, the column type that I am trying to extract is of type udt.

I tried to use

my_array =  df.select(df['my_col'])

but this is not correct as it gives me a list

2
  • To clarify, you are not looking for a Python list, but something like pyspark.sql.types.ArrayType? Commented Nov 2, 2021 at 20:46
  • Sorry for the confusion. Yes you are correct. I need to be able to reshape() it so that I can pass it into function Commented Nov 2, 2021 at 20:49

1 Answer 1

2

collect_list() gives you an array of values.

A. If you want to collect all the values of a column say c2, based on another column say c1, you can group by c1 and collect values of c2 using collect_list.

df = spark.createDataFrame([
    ('emma', 'math'),
    ('emma', 'english'),
    ('mia','english'),
    ('mia','science'),
   ('mona','math'),
   ('mona','geography')
], ["student", "subject"])

from pyspark.sql.functions import collect_list
df1=df.groupBy('student').agg(collect_list('subject'))
df1.show()

B. If you want all values of c2 irrespective of any other column, you can group by a literal:

from pyspark.sql.functions import lit

df1=df.groupBy(lit(1)).agg(collect_list('subject'))
df1.show()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.