0

I want to create a array column from existing column in PySpark

--------------------------
col0 | col1 | col2 | col3
--------------------------
1    |a     |b     |c
--------------------------
2    |d     |e     |f
--------------------------

I want like this

-------------
col0 | col1 
-------------
1    |[a,b,c]
-------------
2    |[d,e,f]
--------------

I was trying array() function like this

>>> new = df.select("col0",array("col1","col2","col3").alias("col1"))

but getting this error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'list' object is not callable

Please if anyone have solution on this ..

1
  • It Worked after I restart my pyspark.. Commented Nov 12, 2020 at 6:10

1 Answer 1

1

You need to use withColumn() first while creating a new column , post that you can use select() in order to select columns as per your choice

df = df.withColumn("col0", array("col1","col2","col3"))
df = df.select("col0")

and you are getting this error because, you are using .alias() function and the compiler is complaining about that

Sign up to request clarification or add additional context in comments.

2 Comments

alias should work on array because array returns a column. I suspect array was not pyspark.sql.functions.array but something else, but after restarting spark as in the answer below, it somehow got replaced by the correct spark array function.
Dont know .. I also tried with list() instead of array() , it gave me the same error

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.