4

The below code works in Scala-Spark.

scala> val ar = Array("oracle", "java")
ar: Array[String] = Array(oracle, java)

scala> df.withColumn("tags", lit(ar)).show(false)
+------+---+----------+----------+--------------+
|name  |age|role      |experience|tags          |
+------+---+----------+----------+--------------+
|John  |25 |Developer |2.56      |[oracle, java]|
|Scott |30 |Tester    |5.2       |[oracle, java]|
|Jim   |28 |DBA       |3.0       |[oracle, java]|
|Mike  |35 |Consultant|10.0      |[oracle, java]|
|Daniel|26 |Developer |3.2       |[oracle, java]|
|Paul  |29 |Tester    |3.6       |[oracle, java]|
|Peter |30 |Developer |6.5       |[oracle, java]|
+------+---+----------+----------+--------------+

How do I get the same behavior in PySpark? I tried the below, but it doesn't work and throws Java error.

from pyspark.sql.types import *

tag = ["oracle", "java"]
df2.withColumn("tags", lit(tag)).show()

: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [oracle, java]

4

5 Answers 5

5

You can import array from functions module

>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import array

>>> tag=array(lit("oracle"),lit("java")
>>> df2.withColumn("tags",tag).show()

Tested below

>>> from pyspark.sql.functions import array

>>> tag=array(lit("oracle"),lit("java"))
>>> 
>>> ranked.withColumn("tag",tag).show()
+------+--------------+----------+-----+----+----+--------------+               
|gender|    ethinicity|first_name|count|rank|year|           tag|
+------+--------------+----------+-----+----+----+--------------+
|  MALE|      HISPANIC|    JAYDEN|  364|   1|2012|[oracle, java]|
|  MALE|WHITE NON HISP|    JOSEPH|  300|   2|2012|[oracle, java]|
|  MALE|WHITE NON HISP|    JOSEPH|  300|   2|2012|[oracle, java]|
|  MALE|      HISPANIC|     JACOB|  293|   4|2012|[oracle, java]|
|  MALE|      HISPANIC|     JACOB|  293|   4|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     DAVID|  289|   6|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     DAVID|  289|   6|2012|[oracle, java]|
|  MALE|      HISPANIC|   MATTHEW|  279|   8|2012|[oracle, java]|
|  MALE|      HISPANIC|   MATTHEW|  279|   8|2012|[oracle, java]|
|  MALE|      HISPANIC|     ETHAN|  254|  10|2012|[oracle, java]|
|  MALE|      HISPANIC|     ETHAN|  254|  10|2012|[oracle, java]|
|  MALE|WHITE NON HISP|   MICHAEL|  245|  12|2012|[oracle, java]|
|  MALE|WHITE NON HISP|   MICHAEL|  245|  12|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     JACOB|  242|  14|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     JACOB|  242|  14|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     MOSHE|  238|  16|2012|[oracle, java]|
|  MALE|WHITE NON HISP|     MOSHE|  238|  16|2012|[oracle, java]|
|  MALE|      HISPANIC|     ANGEL|  236|  18|2012|[oracle, java]|
|  MALE|      HISPANIC|     AIDEN|  235|  19|2012|[oracle, java]|
|  MALE|WHITE NON HISP|    DANIEL|  232|  20|2012|[oracle, java]|
+------+--------------+----------+-----+----+----+--------------+
only showing top 20 rows
Sign up to request clarification or add additional context in comments.

3 Comments

just curious.. how to assign a python dictionary in the same way?
try with create_map() function
tag2= lit("{'a':1,'b':2}") ranked.withColumn("tag",tag2).show()
3

I found the below list comprehension to work

>>> arr=["oracle","java"]
>>> mp=[ (lambda x:lit(x))(x) for x in arr ]
>>> df.withColumn("mk",array(mp)).show()
+------+---+----------+----------+--------------+
|  name|age|      role|experience|            mk|
+------+---+----------+----------+--------------+
|  John| 25| Developer|      2.56|[oracle, java]|
| Scott| 30|    Tester|       5.2|[oracle, java]|
|   Jim| 28|       DBA|       3.0|[oracle, java]|
|  Mike| 35|Consultant|      10.0|[oracle, java]|
|Daniel| 26| Developer|       3.2|[oracle, java]|
|  Paul| 29|    Tester|       3.6|[oracle, java]|
| Peter| 30| Developer|       6.5|[oracle, java]|
+------+---+----------+----------+--------------+

>>>

Comments

0

There is difference between ar declare in scala and tag declare in python. ar is array type but tag is List type and lit does not allow List that's why it is giving error.

You need to install numpy to declare array like below

import numpy as np
tag = np.array(("oracle","java"))

Just for reference if you use List in scala, it will also give error

scala> val ar = List("oracle","java")
ar: List[String] = List(oracle, java)

scala> df.withColumn("newcol", lit(ar)).printSchema
java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon List(oracle, java)
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
  at scala.util.Try.getOrElse(Try.scala:79)
  at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163)
  at org.apache.spark.sql.functions$.typedLit(functions.scala:127)
  at org.apache.spark.sql.functions$.lit(functions.scala:110)

2 Comments

no it is giving error..tag=["oracle","java"]; tag2=np.array(tag) works but df.withColumn("tag",lit(tag2)) again throws error
why are you using tag2=np.array(tag) you should use tag = np.array(("oracle","java")) as i had mentioned.
0

Spark 3.4+

F.lit(["oracle", "java"])

Full example:

from pyspark.sql import functions as F

df = spark.range(5)
df = df.withColumn("tags", F.lit(["oracle", "java"]))

df.show()
# +---+--------------+
# | id|          tags|
# +---+--------------+
# |  0|[oracle, java]|
# |  1|[oracle, java]|
# |  2|[oracle, java]|
# |  3|[oracle, java]|
# |  4|[oracle, java]|
# +---+--------------+

Comments

0

In the above answer are not appropriate.as we are taking the array of literals . In general for any application we have list of items in the below format and we cannot append that list directly to pyspark dataframe .we should iterate though each of the list item and then converting to literal and then passing the group of literals to pyspark Array function so we can add this Array as new column to the pyspark dataframe. Hope the below code helps

from pyspark.sql.types import * 
import pyspark.sql.functions as F 
import numpy as np 
arr1 = np.array(["oracle","java"]) 
print(arr1)
df_test = spark.createDataFrame(data=[('finance',10),('marketing',20),('sales',30),('IT',40)],schema=['deptname','deptid'])
df_test = df_test.withColumn("tags",F.array([F.lit(item) for item in arr1]))
df_test.show(truncate=False)

2 Comments

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.
I updated the answer as requested. can you please validate it now.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.