Flatten any nested json string and convert to dataframe using spark scala

Question

I am trying to create dataframe from any json string to dataframe. The json string is generally very deep and nested some times. The json string is like:

val json_string = """{
                   "Total Value": 3,
                   "Topic": "Example",
                   "values": [
                              {
                                "value1": "#example1",
                                "points": [
                                           [
                                           "123",
                                           "156"
                                          ]
                                    ],
                                "properties": {
                                 "date": "12-04-19",
                                 "model": "Model example 1"
                                    }
                                 },
                               {"value2": "#example2",
                                "points": [
                                           [
                                           "124",
                                           "157"
                                          ]
                                    ],
                                "properties": {
                                 "date": "12-05-19",
                                 "model": "Model example 2"
                                    }
                                 }
                              ]
                       }"""

The output which I am expecting is:

+-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+
|Total Value| Topic     |values 1 | values.points[0] | values.points[1] | values.properties.date | values.properties.model |
+-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+
| 3         | Example   | example1 | 123              | 156              | 12-04-19               |  Model Example 1         |
| 3         | Example   | example2 | 124              | 157              | 12-05-19               |    Model example 2         
+-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+

I am doing flattening but choosing some key in json for getting schema and then flattening but I don't want to flatten in this way. It should be independent of any key to be given and flatten accordingly as shown in output in above. Even after giving key that is values in this case, I am still getting 2 columns for same records due to the points is array so points[0] one columns and points[1] for different columns. My Scala spark code is:

val key = "values" //Ideally this should not be given in my case.
val jsonFullDFSchemaString = spark.read.json(json_location).select(col(key)).schema.json; // changing values to reportData
val jsonFullDFSchemaStructType = DataType.fromJson(jsonFullDFSchemaString).asInstanceOf[StructType]
val df = spark.read.schema(jsonFullDFSchemaStructType).json(json_location);

Now for flattening I am using:

 def flattenDataframe(df: DataFrame): DataFrame = {
    //getting all the fields from schema
    val fields = df.schema.fields
    val fieldNames = fields.map(x => x.name)
    //length shows the number of fields inside dataframe
    val length = fields.length
    for (i <- 0 to fields.length - 1) {
      val field = fields(i)
      val fieldtype = field.dataType
      val fieldName = field.name
      fieldtype match {
        case arrayType: ArrayType =>
          val fieldName1 = fieldName
          val fieldNamesExcludingArray = fieldNames.filter(_ != fieldName1)
          val fieldNamesAndExplode = fieldNamesExcludingArray ++ Array(s"explode_outer($fieldName1) as $fieldName1")
          //val fieldNamesToSelect = (fieldNamesExcludingArray ++ Array(s"$fieldName1.*"))
          val explodedDf = df.selectExpr(fieldNamesAndExplode: _*)
          return flattenDataframe(explodedDf)

        case structType: StructType =>
          val childFieldnames = structType.fieldNames.map(childname => fieldName + "." + childname)
          val newfieldNames = fieldNames.filter(_ != fieldName) ++ childFieldnames
          val renamedcols = newfieldNames.map(x => (col(x.toString()).as(x.toString().replace(".", "_").replace("$", "_").replace("__", "_").replace(" ", "").replace("-", ""))))
          val explodedf = df.select(renamedcols: _*)
          return flattenDataframe(explodedf)
        case _ =>
      }
    }
    df
  }

Now finally getting flatten dataframe from json:

val tableSchemaDF = flattenDataframe(df)
println(tableSchemaDF)

So ideally any json file should get flatten accordingly as I shown above without giving any root key and without creating 2 rows. Hope I have given enough details. Any help will be appreciated. Thanks.

Please Note: The Json Data is coming from API so it's not certain that the root key 'values' will be there or not. That's why I am not going with giving key for flattening.

have you validated your JSON? I think it's not in the proper formatted. — Mahesh Gupta
– Mahesh Gupta, Commented Dec 12, 2019 at 8:05

baitmbarek · Accepted Answer · 2019-12-12 09:43:40Z

2

Here's a solution based on recursion, just a bit "hacky" at the end since you have specificities :

@tailrec
def recurs(df: DataFrame): DataFrame = {
  if(df.schema.fields.find(_.dataType match {
    case ArrayType(StructType(_),_) | StructType(_) => true
    case _ => false
  }).isEmpty) df
  else {
    val columns = df.schema.fields.map(f => f.dataType match {
      case _: ArrayType => explode(col(f.name)).as(f.name)
      case s: StructType => col(s"${f.name}.*")
      case _ => col(f.name)
    })
    recurs(df.select(columns:_*))
  }
}

val recursedDF = recurs(df)
val valuesColumns = recursedDF.columns.filter(_.startsWith("value"))
val projectionDF = recursedDF.withColumn("values", coalesce(valuesColumns.map(col):_*))
  .withColumn("point[0]", $"points".getItem(0))
  .withColumn("point[1]", $"points".getItem(1))
    .drop(valuesColumns :+ "points":_*)
projectionDF.show(false)

Output :

+-------+-----------+--------+---------------+---------+--------+--------+
|Topic  |Total Value|date    |model          |values   |point[0]|point[1]|
+-------+-----------+--------+---------------+---------+--------+--------+
|Example|3          |12-04-19|Model example 1|#example1|123     |156     |
|Example|3          |12-05-19|Model example 2|#example2|124     |157     |
+-------+-----------+--------+---------------+---------+--------+--------+

answered Dec 12, 2019 at 9:43

baitmbarek

2,5264 gold badges20 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Mohammad Rijwan Over a year ago

I think it's still have "value" column name which is not desired

baitmbarek Over a year ago

The problem is you want to "merge" different fields into a single field. ie: "value1" and "value2" should be used to populate a single column in your example. I don't see how we could be more generic for this.

Mohammad Rijwan Over a year ago

Hi @baitmbarek, thanks a lot your code really helps. Just one request if I have to give column name as values.value1, values.points instead of values and points...what changes I have to make.

baitmbarek Over a year ago

Hi @MohammadRijwan, can you please edit your question and add an extra paragraph to specify the output you're expecting? Would be glad to help but I'm not sure about the structure :)

Mohammad Rijwan Over a year ago

Hi @baitmbarek, please see I have asked this in separate question: stackoverflow.com/questions/59873760/…

|

Mohammad Rijwan · Accepted Answer · 2019-12-12 12:30:56Z

1

I would rather suggest going with the spark in-built function. You can take advantage of the explode of a spark function to achieve this.

here is the code snippet.

scala> val df = spark.read.json(Seq(json_string).toDS)
scala> var dfd = df.select($"topic",$"total value",explode($"values").as("values"))

Here I am choosing the column based on your needs. If no column is in the dataframe, please add based on your requirement.

scala> dfd.select($"topic",$"total value",$"values.points".getItem(0)(0).as("point_0"),$"values.points".getItem(0)(1).as("point_1"),$"values.properties.date".as("_date"),$"values.properties.model".as("_model")).show
+-------+-----------+-------+-------+--------+---------------+
|  topic|total value|point_0|point_1|   _date|         _model|
+-------+-----------+-------+-------+--------+---------------+
|Example|          3|    123|    156|12-04-19|Model example 1|
|Example|          3|    124|    157|12-05-19|Model example 2|
+-------+-----------+-------+-------+--------+---------------+

If you have a limited number of columns in JSON, this approach will give you an optimal result.

edited Dec 12, 2019 at 12:30

Mohammad Rijwan

3574 silver badges20 bronze badges

answered Dec 12, 2019 at 10:05

Mahesh Gupta

1,90214 silver badges17 bronze badges

2 Comments

Mohammad Rijwan Over a year ago

Actually the json is coming from API so it's not certain whether the key "values" will be same or not

MD Rijwan Over a year ago

Exactly just a sample.

Collectives™ on Stack Overflow

Flatten any nested json string and convert to dataframe using spark scala

2 Answers 2

6 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related