1

I have a PySpark dataframe where columns have JSON string values like this:

col1                          col2
{"d1":"2343","v1":"3434"} {"id1":"123"}
{"d1":"2344","v1":"3435"} {"id1":"124"}

I want to update "col1" JSON string values with "col2" JSON string values to get this:

col1                                     col2
{"d1":"2343","v1":"3434","id1":"123"}    {"id1":"123"}
{"d1":"2344","v1":"3435","id1":"124"}    {"id1":"124"}

How to do this in PySpark?

0

1 Answer 1

1

Since you're dealing with string type columns, you can remove the last } from "col1", remove the first { from "col2" and join the strings together with comma , as delimiter.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('{"d1":"2343","v1":"3434"}', '{"id1":"123"}'),
     ('{"d1":"2344","v1":"3435"}', '{"id1":"124"}')],
    ["col1", "col2"])

Script:

df = df.withColumn(
    "col1",
    F.concat_ws(
        ",",
        F.regexp_replace("col1", r"}$", ""),
        F.regexp_replace("col2", r"^\{", "")
    )
)

df.show(truncate=0)
# +-------------------------------------+-------------+
# |col1                                 |col2         |
# +-------------------------------------+-------------+
# |{"d1":"2343","v1":"3434","id1":"123"}|{"id1":"123"}|
# |{"d1":"2344","v1":"3435","id1":"124"}|{"id1":"124"}|
# +-------------------------------------+-------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.