0

I have some data that needs to be written as a JSON string after some transformations in a spark (+scala) job. I'm using the to_json function along with struct and/or array function in order to build the final json that is requested.

I have one piece of the json that looks like:

"field":[
    "foo",
    {
        "inner_field":"bar"
    }
]

I'm not an expert in JSON, so I don't know if this structure is usual or not, all I know is that this is a valid JSON format. I'm having trouble to create a dataframe column with this format and I want to know what is the best way to create this type of data columns.

Thanks in advance

1 Answer 1

1

If you have a dataframe with a bunch of columns you want to turn into a json string column, you can make use of the to_json and the struct functions. Something like this:

import org.apache.spark.sql.types._

val df = Seq(
  (1, "string1", Seq("string2", "string3")),
  (2, "string4", Seq("string5", "string6"))
  ).toDF("colA", "colB", "colC")

df.show                                                                                                                                                                                                                                                                  
+----+-------+------------------+                                                                                                                                                                                                                                               
|colA|   colB|              colC|                                                                                                                                                                                                                                               
+----+-------+------------------+                                                                                                                                                                                                                                               
|   1|string1|[string2, string3]|                                                                                                                                                                                                                                               
|   2|string4|[string5, string6]|                                                                                                                                                                                                                                               
+----+-------+------------------+

val newDf = df.withColumn("jsonString", to_json(struct($"colA", $"colB", $"colC")))

newDf.show(false)                                                                                                                                                                                                                                                        
+----+-------+------------------+--------------------------------------------------------+                                                                                                                                                                                      
|colA|colB   |colC              |jsonString                                              |                                                                                                                                                                                      
+----+-------+------------------+--------------------------------------------------------+                                                                                                                                                                                      
|1   |string1|[string2, string3]|{"colA":1,"colB":"string1","colC":["string2","string3"]}|                                                                                                                                                                                      
|2   |string4|[string5, string6]|{"colA":2,"colB":"string4","colC":["string5","string6"]}|                                                                                                                                                                                      
+----+-------+------------------+--------------------------------------------------------+

struct makes a single StructType column from multiple columns and to_json turns them into a json string.

Hope this helps!

Sign up to request clarification or add additional context in comments.

1 Comment

Hi. We are using that approach now, but we still have issues when trying to get the output that I put in the question. Can you try to give an example where you take some columns and get that specific result?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.