35

Let's say I have a numpy array a that contains the numbers 1-10:
[1 2 3 4 5 6 7 8 9 10]

I also have a Spark dataframe to which I want to add my numpy array a. I figure that a column of literals will do the job. This doesn't work:

df = df.withColumn("NewColumn", F.lit(a))

Unsupported literal type class java.util.ArrayList

But this works:

df = df.withColumn("NewColumn", F.lit(a[0]))

How to do it?

Example DF before:

col1
a b c d e f g h i j

Expected result:

col1 NewColumn
a b c d e f g h i j 1 2 3 4 5 6 7 8 9 10
0

2 Answers 2

56

List comprehension inside Spark's array

a = [1,2,3,4,5,6,7,8,9,10]
df = spark.createDataFrame([['a b c d e f g h i j '],], ['col1'])
df = df.withColumn("NewColumn", F.array([F.lit(x) for x in a]))

df.show(truncate=False)
df.printSchema()
#  +--------------------+-------------------------------+
#  |col1                |NewColumn                      |
#  +--------------------+-------------------------------+
#  |a b c d e f g h i j |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
#  +--------------------+-------------------------------+
#  root
#   |-- col1: string (nullable = true)
#   |-- NewColumn: array (nullable = false)
#   |    |-- element: integer (containsNull = false)

@pault commented (Python 2.7):

You can hide the loop using map:
df.withColumn("NewColumn", F.array(map(F.lit, a)))

@ abegehr added Python 3 version:

df.withColumn("NewColumn", F.array(*map(F.lit, a)))

Spark's udf

# Defining UDF
def arrayUdf():
    return a
callArrayUdf = F.udf(arrayUdf, T.ArrayType(T.IntegerType()))

# Calling UDF
df = df.withColumn("NewColumn", callArrayUdf())

Output is the same.

Sign up to request clarification or add additional context in comments.

7 Comments

I tried this and it works. Thank you for the answer and I will keep it this way for now. However, in reality, my "a" array has tens of thousands of entries, and because of the for loop, it is not quite efficient. Is there a way to do it without loops?
@A.R. I have updated my answer using udf function which doesn't require for loop. If the answer is helpful you can accept it and upvote
You can hide the loop using map: df.withColumn("NewColumn", F.array(map(F.lit, a)))
@pault Isn't map an rdd function? Also output of map is neither a string or column so withColumn would throw an error.
@pault, I think this should be F.array(*map(F.lit, a)) with the (star) spread operator, since F.array cannot handle a map object.
|
2

In scala API, we can use "typedLit" function to add the Array or map values in the column.

// Ref : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

Here is the sample code to add an Array or Map as a column value.

import org.apache.spark.sql.functions.typedLit

val df1 = Seq((1, 0), (2, 3)).toDF("a", "b")

df1.withColumn("seq", typedLit(Seq(1,2,3)))
    .withColumn("map", typedLit(Map(1 -> 2)))
    .show(truncate=false)

// Output

+---+---+---------+--------+
|a  |b  |seq      |map     |
+---+---+---------+--------+
|1  |0  |[1, 2, 3]|[1 -> 2]|
|2  |3  |[1, 2, 3]|[1 -> 2]|
+---+---+---------+--------+

I hope this helps.

1 Comment

This doesn't answer the question, the OP has asked for a pyspark solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.