2

Suppose I have two columns (StringType):

JSON 1 JSON 2
{"key1":"value1"} {"key2":"value2"}
{"key3":"value3"} {"key4":"value4"}

I need to concat these columns to have a LIST of JSON in the third column (also string), something like this:

JSON 1 JSON 2 JSON 3
{"key1":"value1"} {"key2":"value2"} [{"key1":"value1"},{"key2":"value2"}]
{"key3":"value3"} {"key4":"value4"} [{"key3":"value3"},{"key4":"value4"}]

Actually I solved the problem, but I don't think it's the best approach:

from pyspark.sql import DataFrame
from pyspark.sql.functions import col, concat, concat_ws, lit

def concat_jsons(df: DataFrame, columns: list):
    df = df.withColumn(
        'JSON 3', concat_ws(',', *columns)
    )

    return df.withColumn(
        'JSON 3',
        concat(lit('['), col('JSON 3'), lit(']'))
    )

Anyone has a better idea?

3
  • 1
    what is the data type of the columns? Commented Sep 25, 2023 at 16:18
  • Both columns JSON 1 and JSON 2 are string. Commented Sep 25, 2023 at 17:43
  • if by list you mean array, you could then just use the array function -- array('json 1', 'json 2') Commented Sep 26, 2023 at 7:29

1 Answer 1

2
data =[["""{"key1":"value1"}""", """{"key2":"value2"}"""], ["""{"key3":"value3"}""","""{"key4":"value4"}"""] ]

df = spark.createDataFrame(data).toDF('JSON 1','JSON 2')

df.show()
+-----------------+-----------------+
|           JSON 1|           JSON 2|
+-----------------------------------+
|{"key1":"value1"}|{"key2":"value2"}|
|{"key3":"value3"}|{"key4":"value4"}|
+-----------------+-----------------+

def concat_jsons(df: DataFrame, columns: list):
...   return df.withColumn('JSON 3',array(*columns))

concat_jsons(df, ['JSON 1','JSON 2']).show()

+-----------------+-----------------+--------------------+
|           JSON 1|           JSON 2|              JSON 3|
+-----------------+-----------------+--------------------+
|{"key1":"value1"}|{"key2":"value2"}|[{"key1":"value1"...|
|{"key3":"value3"}|{"key4":"value4"}|[{"key3":"value3"...|
+-----------------+-----------------+--------------------+

if you want "JSON 3" to be for StringType, cast it to string

def concat_jsons(df: DataFrame, columns: list):
...   return df.withColumn('JSON 3',array(*columns).cast('string'))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.