PySpark equivalent of adding a constant array to a dataframe as column

Question

The below code works in Scala-Spark.

scala> val ar = Array("oracle", "java")
ar: Array[String] = Array(oracle, java)

scala> df.withColumn("tags", lit(ar)).show(false)
+------+---+----------+----------+--------------+
|name  |age|role      |experience|tags          |
+------+---+----------+----------+--------------+
|John  |25 |Developer |2.56      |[oracle, java]|
|Scott |30 |Tester    |5.2       |[oracle, java]|
|Jim   |28 |DBA       |3.0       |[oracle, java]|
|Mike  |35 |Consultant|10.0      |[oracle, java]|
|Daniel|26 |Developer |3.2       |[oracle, java]|
|Paul  |29 |Tester    |3.6       |[oracle, java]|
|Peter |30 |Developer |6.5       |[oracle, java]|
+------+---+----------+----------+--------------+

How do I get the same behavior in PySpark? I tried the below, but it doesn't work and throws Java error.

from pyspark.sql.types import *

tag = ["oracle", "java"]
df2.withColumn("tags", lit(tag)).show()

: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [oracle, java]

Does this answer your question? How to add a constant column in a Spark DataFrame? — blackbishop
– blackbishop, Commented Dec 30, 2019 at 14:18
In pyspark, you should be using tag = [lit("oracle"), lit("java")] df2.withColumn("tags", array(*tag)).show() as explained in the accepted answer — blackbishop
– blackbishop, Commented Dec 30, 2019 at 14:41
Does this answer your question? Combine PySpark DataFrame ArrayType fields into single ArrayType field — Ged
– Ged, Commented Dec 30, 2019 at 14:48

Strick · Accepted Answer · 2019-12-30 14:45:57Z

5

You can import array from functions module

>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import array

>>> tag=array(lit("oracle"),lit("java")
>>> df2.withColumn("tags",tag).show()

Tested below

>>> from pyspark.sql.functions import array

>>> tag=array(lit("oracle"),lit("java"))
>>> 
>>> ranked.withColumn("tag",tag).show()
+------+--------------+----------+-----+----+----+--------------+               
|gender|    ethinicity|first_name|count|rank|year|           tag|
+------+--------------+----------+-----+----+----+--------------+
|  MALE|      HISPANIC|    JAYDEN|  364|   1|2012|[oracle, java]|
|  MALE|WHITE NON HISP|    JOSEPH|  300|   2|2012|[oracle, java]|
|  MALE|WHITE NON HISP|    JOSEPH|  300|   2|2012|[oracle, java]|
|  MALE|      HISPANIC|     JACOB|  293|   4|2012|[oracle, java]|
|  MALE|      HISPANIC|     JACOB|  293|   4|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     DAVID|  289|   6|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     DAVID|  289|   6|2012|[oracle, java]|
|  MALE|      HISPANIC|   MATTHEW|  279|   8|2012|[oracle, java]|
|  MALE|      HISPANIC|   MATTHEW|  279|   8|2012|[oracle, java]|
|  MALE|      HISPANIC|     ETHAN|  254|  10|2012|[oracle, java]|
|  MALE|      HISPANIC|     ETHAN|  254|  10|2012|[oracle, java]|
|  MALE|WHITE NON HISP|   MICHAEL|  245|  12|2012|[oracle, java]|
|  MALE|WHITE NON HISP|   MICHAEL|  245|  12|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     JACOB|  242|  14|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     JACOB|  242|  14|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     MOSHE|  238|  16|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     MOSHE|  238|  16|2012|[oracle, java]|
|  MALE|      HISPANIC|     ANGEL|  236|  18|2012|[oracle, java]|
|  MALE|      HISPANIC|     AIDEN|  235|  19|2012|[oracle, java]|
|  MALE|WHITE NON HISP|    DANIEL|  232|  20|2012|[oracle, java]|
+------+--------------+----------+-----+----+----+--------------+
only showing top 20 rows

answered Dec 30, 2019 at 14:45

Strick

1,65213 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

stack0114106 Over a year ago

just curious.. how to assign a python dictionary in the same way?

Strick Over a year ago

try with create_map() function

Strick Over a year ago

tag2= lit("{'a':1,'b':2}") ranked.withColumn("tag",tag2).show()

stack0114106 · Accepted Answer · 2019-12-30 15:04:32Z

3

I found the below list comprehension to work

>>> arr=["oracle","java"]
>>> mp=[ (lambda x:lit(x))(x) for x in arr ]
>>> df.withColumn("mk",array(mp)).show()
+------+---+----------+----------+--------------+
|  name|age|      role|experience|            mk|
+------+---+----------+----------+--------------+
|  John| 25| Developer|      2.56|[oracle, java]|
| Scott| 30|    Tester|       5.2|[oracle, java]|
|   Jim| 28|       DBA|       3.0|[oracle, java]|
|  Mike| 35|Consultant|      10.0|[oracle, java]|
|Daniel| 26| Developer|       3.2|[oracle, java]|
|  Paul| 29|    Tester|       3.6|[oracle, java]|
| Peter| 30| Developer|       6.5|[oracle, java]|
+------+---+----------+----------+--------------+

>>>

answered Dec 30, 2019 at 15:04

stack0114106

8,8934 gold badges16 silver badges40 bronze badges

Comments

NIKHIL SUTHAR · Accepted Answer · 2019-12-31 05:52:05Z

0

There is difference between ar declare in scala and tag declare in python. ar is array type but tag is List type and lit does not allow List that's why it is giving error.

You need to install numpy to declare array like below

import numpy as np
tag = np.array(("oracle","java"))

Just for reference if you use List in scala, it will also give error

scala> val ar = List("oracle","java")
ar: List[String] = List(oracle, java)

scala> df.withColumn("newcol", lit(ar)).printSchema
java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon List(oracle, java)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
  at scala.util.Try.getOrElse(Try.scala:79)
  at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163)
  at org.apache.spark.sql.functions$.typedLit(functions.scala:127)
  at org.apache.spark.sql.functions$.lit(functions.scala:110)

edited Dec 31, 2019 at 5:52

answered Dec 30, 2019 at 14:15

NIKHIL SUTHAR

2,4511 gold badge11 silver badges33 bronze badges

2 Comments

stack0114106 Over a year ago

no it is giving error..tag=["oracle","java"]; tag2=np.array(tag) works but df.withColumn("tag",lit(tag2)) again throws error

NIKHIL SUTHAR Over a year ago

why are you using tag2=np.array(tag) you should use tag = np.array(("oracle","java")) as i had mentioned.

ZygD · Accepted Answer · 2023-05-01 19:12:56Z

0

Spark 3.4+

F.lit(["oracle", "java"])

Full example:

from pyspark.sql import functions as F

df = spark.range(5)
df = df.withColumn("tags", F.lit(["oracle", "java"]))

df.show()
# +---+--------------+
# | id|          tags|
# +---+--------------+
# |  0|[oracle, java]|
# |  1|[oracle, java]|
# |  2|[oracle, java]|
# |  3|[oracle, java]|
# |  4|[oracle, java]|
# +---+--------------+

answered May 1, 2023 at 19:12

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

Comments

Krishna · Accepted Answer · 2024-07-26 01:19:03Z

0

In the above answer are not appropriate.as we are taking the array of literals . In general for any application we have list of items in the below format and we cannot append that list directly to pyspark dataframe .we should iterate though each of the list item and then converting to literal and then passing the group of literals to pyspark Array function so we can add this Array as new column to the pyspark dataframe. Hope the below code helps

from pyspark.sql.types import * 
import pyspark.sql.functions as F 
import numpy as np 
arr1 = np.array(["oracle","java"]) 
print(arr1)
df_test = spark.createDataFrame(data=[('finance',10),('marketing',20),('sales',30),('IT',40)],schema=['deptname','deptid'])
df_test = df_test.withColumn("tags",F.array([F.lit(item) for item in arr1]))
df_test.show(truncate=False)

edited Jul 26, 2024 at 1:19

answered Nov 25, 2023 at 11:55

Krishna

112 bronze badges

2 Comments

Community Over a year ago

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.

Krishna Over a year ago

I updated the answer as requested. can you please validate it now.

Collectives™ on Stack Overflow

PySpark equivalent of adding a constant array to a dataframe as column

5 Answers 5

3 Comments

Comments

2 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

Comments

2 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related