2

as part of my dataframe, one of the column has data in following manner

[{"text":"Tea"},{"text":"GoldenGlobes"}]

And I want to convert that as just array of strings.

["Tea", "GoldenGlobes"]

Would someone please let me know, how to do this?

1
  • You can use from_json(), creating a schema with ArrayType() and select the fields named by text. . See here an example how to use Commented Jul 15, 2019 at 0:26

4 Answers 4

2

See the example below without udf:

import pyspark.sql.functions as f
from pyspark import Row
from pyspark.shell import spark
from pyspark.sql.types import ArrayType, StructType, StructField, StringType

df = spark.createDataFrame([
    Row(values='[{"text":"Tea"},{"text":"GoldenGlobes"}]'),
    Row(values='[{"text":"GoldenGlobes"}]')
])

schema = ArrayType(StructType([
    StructField('text', StringType())
]))

df \
    .withColumn('array_of_str', f.from_json(f.col('values'), schema).text) \
    .show()

Output:

+--------------------+-------------------+
|              values|       array_of_str|
+--------------------+-------------------+
|[{"text":"Tea"},{...|[Tea, GoldenGlobes]|
|[{"text":"GoldenG...|     [GoldenGlobes]|
+--------------------+-------------------+
Sign up to request clarification or add additional context in comments.

3 Comments

this raised this error for me Can only star expand struct data types. Attribute: "ArrayBuffer(jsonData)"; do you have any idea how can I fix that?
@RamaSalahat without seeing your data and code it is hard to imagine, can you publish a question containing these information?
okay I'll do just that!
0

If the type of your column is array then something like this should work (not tested):

from pyspark.sql import functions as F
from pyspark.sql import types as T

c = F.array([F.get_json_object(F.col("colname")[0], '$.text')),  
             F.get_json_object(F.col("colname")[1], '$.text'))])

df = df.withColumn("new_col", c)

Or if the length is not fixed (I do not see a solution without an udf) :

F.udf(T.ArrayType())
def get_list(x):
    o_list = []
    for elt in x:
        o_list.append(elt["text"])
    return o_list

df = df.withColumn("new_col", get_list("colname"))

2 Comments

there could be more than 2 elements in array. the length is not fixed.
Then you will have to use an udf, I'll edit the answer
0

Sharing the Java syntax :

import static org.apache.spark.sql.functions.from_json;
import static org.apache.spark.sql.functions.get_json_object;
import static org.apache.spark.sql.functions.col;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import static org.apache.spark.sql.types.DataTypes.StringType;

Dataset<Row> df = getYourDf();

StructType structschema =
                DataTypes.createStructType(
                        new StructField[] {
                                DataTypes.createStructField("text", StringType, true)
                        });

ArrayType schema = new ArrayType(structschema,true);


df = df.withColumn("array_of_str",from_json(col("colname"), schema).getField("text"));

Comments

0

I am facing the exact opposite problem. I wanted to convert the array to strings along with the keys and values but unfortunately converting the arrays into strings is removin the keys. This could be the solution in your case though.

if isinstance(field.dataType, ArrayType):
    df = df.withColumn(field.name, col(field.name).cast("string"))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.