Spark - convert JSON array object to array of string

Question

as part of my dataframe, one of the column has data in following manner

[{"text":"Tea"},{"text":"GoldenGlobes"}]

And I want to convert that as just array of strings.

["Tea", "GoldenGlobes"]

Would someone please let me know, how to do this?

You can use from_json(), creating a schema with ArrayType() and select the fields named by text. . See here an example how to use — Kafels
– Kafels, Commented Jul 15, 2019 at 0:26

Kafels · Accepted Answer · 2019-07-15 17:13:49Z

2

See the example below without udf:

import pyspark.sql.functions as f
from pyspark import Row
from pyspark.shell import spark
from pyspark.sql.types import ArrayType, StructType, StructField, StringType

df = spark.createDataFrame([
    Row(values='[{"text":"Tea"},{"text":"GoldenGlobes"}]'),
    Row(values='[{"text":"GoldenGlobes"}]')
])

schema = ArrayType(StructType([
    StructField('text', StringType())
]))

df \
    .withColumn('array_of_str', f.from_json(f.col('values'), schema).text) \
    .show()

Output:

+--------------------+-------------------+
|              values|       array_of_str|
+--------------------+-------------------+
|[{"text":"Tea"},{...|[Tea, GoldenGlobes]|
|[{"text":"GoldenG...|     [GoldenGlobes]|
+--------------------+-------------------+

answered Jul 15, 2019 at 17:13

Kafels

4,0891 gold badge18 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rama Salahat Over a year ago

this raised this error for me Can only star expand struct data types. Attribute: "ArrayBuffer(jsonData)"; do you have any idea how can I fix that?

Kafels Over a year ago

@RamaSalahat without seeing your data and code it is hard to imagine, can you publish a question containing these information?

Rama Salahat Over a year ago

okay I'll do just that!

score 0 · Accepted Answer · 2019-07-14 21:24:18Z

0

If the type of your column is array then something like this should work (not tested):

from pyspark.sql import functions as F
from pyspark.sql import types as T

c = F.array([F.get_json_object(F.col("colname")[0], '$.text')),  
             F.get_json_object(F.col("colname")[1], '$.text'))])

df = df.withColumn("new_col", c)

Or if the length is not fixed (I do not see a solution without an udf) :

F.udf(T.ArrayType())
def get_list(x):
    o_list = []
    for elt in x:
        o_list.append(elt["text"])
    return o_list

df = df.withColumn("new_col", get_list("colname"))

edited Jul 14, 2019 at 21:24

answered Jul 14, 2019 at 21:12

user6473579

2 Comments

Gaurang Shah Over a year ago

there could be more than 2 elements in array. the length is not fixed.

user6473579 Over a year ago

Then you will have to use an udf, I'll edit the answer

userab · Accepted Answer · 2020-06-17 09:51:10Z

Sharing the Java syntax :

import static org.apache.spark.sql.functions.from_json;
import static org.apache.spark.sql.functions.get_json_object;
import static org.apache.spark.sql.functions.col;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import static org.apache.spark.sql.types.DataTypes.StringType;

Dataset<Row> df = getYourDf();

StructType structschema =
                DataTypes.createStructType(
                        new StructField[] {
                                DataTypes.createStructField("text", StringType, true)
                        });

ArrayType schema = new ArrayType(structschema,true);


df = df.withColumn("array_of_str",from_json(col("colname"), schema).getField("text"));

Harsha · Accepted Answer · 2025-03-20 10:33:31Z

0

I am facing the exact opposite problem. I wanted to convert the array to strings along with the keys and values but unfortunately converting the arrays into strings is removin the keys. This could be the solution in your case though.

if isinstance(field.dataType, ArrayType):
    df = df.withColumn(field.name, col(field.name).cast("string"))

answered Mar 20 at 10:33

Harsha

1

Collectives™ on Stack Overflow

Spark - convert JSON array object to array of string

4 Answers 4

3 Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related