Read Array of Strings as Array in Pyspark from CSV

Question

I have a csv file with data like this

ID|Arr_of_Str
 1|["ABC DEF"]
 2|["PQR", "ABC DEF"]

I want to read this .csv file, however when I am using sqlContext.read.load, it is reading it as string

Current:

df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Arr_of_Str: string (nullable = true)

Expected:

df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Arr_of_Str: array (nullable = true)
      |-- element: string (containsNull = true)

How can I cast string to array of string?

blackbishop · Accepted Answer · 2021-02-16 15:09:42Z

5

Update:

Actually, you can simply use from_json to parse Arr_of_Str column as array of strings :

from pyspark.sql import functions as F

df2 = df.withColumn(
    "Arr_of_Str",
    F.from_json(F.col("Arr_of_Str"), "array<string>")
)

df1.show(truncate=False)

#+---+--------------+
#|ID |Arr_of_Str    |
#+---+--------------+
#| 1 |[ABC DEF]     |
#| 2 |[PQR, ABC DEF]|
#+---+--------------+

Old answer:

You can't do that when reading data as there is no support for complexe data structures in CSV. You'll have to do the transformation after you loaded the DataFrame.

Just remove the array square brackets from the string and split it to get an array column.

from pyspark.sql.functions import split, regexp_replace

df2 = df.withColumn("Arr_of_Str", split(regexp_replace(col("Arr_of_Str"), '[\\[\\]]', ""), ","))

df2.show()

+---+-------------------+
| ID|         Arr_of_Str|
+---+-------------------+
|  1|        ["ABC DEF"]|
|  2|["PQR",  "ABC DEF"]|
+---+-------------------+

df2.printSchema()

root
 |-- ID: string (nullable = true)
 |-- Arr_of_Str: array (nullable = true)
 |    |-- element: string (containsNull = true)

edited Feb 16, 2021 at 15:09

answered Dec 12, 2019 at 11:39

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user16798185 Over a year ago

I tried this and works fine. As per the doc spark.apache.org/docs/3.1.1/api/python/reference/api/… , return type is struct. Just curious how it works given input is not json string? can we use from_json to cast to any other dataypes other than struct?

Collectives™ on Stack Overflow

Read Array of Strings as Array in Pyspark from CSV

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related