3

I am trying to create dataframe from any json string to dataframe. The json string is generally very deep and nested some times. The json string is like:

val json_string = """{
                   "Total Value": 3,
                   "Topic": "Example",
                   "values": [
                              {
                                "value1": "#example1",
                                "points": [
                                           [
                                           "123",
                                           "156"
                                          ]
                                    ],
                                "properties": {
                                 "date": "12-04-19",
                                 "model": "Model example 1"
                                    }
                                 },
                               {"value2": "#example2",
                                "points": [
                                           [
                                           "124",
                                           "157"
                                          ]
                                    ],
                                "properties": {
                                 "date": "12-05-19",
                                 "model": "Model example 2"
                                    }
                                 }
                              ]
                       }"""

The output which I am expecting is:

+-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+
|Total Value| Topic     |values 1 | values.points[0] | values.points[1] | values.properties.date | values.properties.model |
+-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+
| 3         | Example   | example1 | 123              | 156              | 12-04-19               |  Model Example 1         |
| 3         | Example   | example2 | 124              | 157              | 12-05-19               |    Model example 2         
+-----------+-----------+----------+------------------+------------------+------------------------+-----------------------------+

I am doing flattening but choosing some key in json for getting schema and then flattening but I don't want to flatten in this way. It should be independent of any key to be given and flatten accordingly as shown in output in above. Even after giving key that is values in this case, I am still getting 2 columns for same records due to the points is array so points[0] one columns and points[1] for different columns. My Scala spark code is:

val key = "values" //Ideally this should not be given in my case.
val jsonFullDFSchemaString = spark.read.json(json_location).select(col(key)).schema.json; // changing values to reportData
val jsonFullDFSchemaStructType = DataType.fromJson(jsonFullDFSchemaString).asInstanceOf[StructType]
val df = spark.read.schema(jsonFullDFSchemaStructType).json(json_location);

Now for flattening I am using:

 def flattenDataframe(df: DataFrame): DataFrame = {
    //getting all the fields from schema
    val fields = df.schema.fields
    val fieldNames = fields.map(x => x.name)
    //length shows the number of fields inside dataframe
    val length = fields.length
    for (i <- 0 to fields.length - 1) {
      val field = fields(i)
      val fieldtype = field.dataType
      val fieldName = field.name
      fieldtype match {
        case arrayType: ArrayType =>
          val fieldName1 = fieldName
          val fieldNamesExcludingArray = fieldNames.filter(_ != fieldName1)
          val fieldNamesAndExplode = fieldNamesExcludingArray ++ Array(s"explode_outer($fieldName1) as $fieldName1")
          //val fieldNamesToSelect = (fieldNamesExcludingArray ++ Array(s"$fieldName1.*"))
          val explodedDf = df.selectExpr(fieldNamesAndExplode: _*)
          return flattenDataframe(explodedDf)

        case structType: StructType =>
          val childFieldnames = structType.fieldNames.map(childname => fieldName + "." + childname)
          val newfieldNames = fieldNames.filter(_ != fieldName) ++ childFieldnames
          val renamedcols = newfieldNames.map(x => (col(x.toString()).as(x.toString().replace(".", "_").replace("$", "_").replace("__", "_").replace(" ", "").replace("-", ""))))
          val explodedf = df.select(renamedcols: _*)
          return flattenDataframe(explodedf)
        case _ =>
      }
    }
    df
  }

Now finally getting flatten dataframe from json:

val tableSchemaDF = flattenDataframe(df)
println(tableSchemaDF)

So ideally any json file should get flatten accordingly as I shown above without giving any root key and without creating 2 rows. Hope I have given enough details. Any help will be appreciated. Thanks.

Please Note: The Json Data is coming from API so it's not certain that the root key 'values' will be there or not. That's why I am not going with giving key for flattening.

3
  • 1
    have you validated your JSON? I think it's not in the proper formatted. Commented Dec 12, 2019 at 8:05
  • Thank you @baithmbarek for correction of my json string. Commented Dec 12, 2019 at 9:01
  • I think it will help @Mahesh Commented Dec 12, 2019 at 9:01

2 Answers 2

2

Here's a solution based on recursion, just a bit "hacky" at the end since you have specificities :

@tailrec
def recurs(df: DataFrame): DataFrame = {
  if(df.schema.fields.find(_.dataType match {
    case ArrayType(StructType(_),_) | StructType(_) => true
    case _ => false
  }).isEmpty) df
  else {
    val columns = df.schema.fields.map(f => f.dataType match {
      case _: ArrayType => explode(col(f.name)).as(f.name)
      case s: StructType => col(s"${f.name}.*")
      case _ => col(f.name)
    })
    recurs(df.select(columns:_*))
  }
}

val recursedDF = recurs(df)
val valuesColumns = recursedDF.columns.filter(_.startsWith("value"))
val projectionDF = recursedDF.withColumn("values", coalesce(valuesColumns.map(col):_*))
  .withColumn("point[0]", $"points".getItem(0))
  .withColumn("point[1]", $"points".getItem(1))
    .drop(valuesColumns :+ "points":_*)
projectionDF.show(false)

Output :

+-------+-----------+--------+---------------+---------+--------+--------+
|Topic  |Total Value|date    |model          |values   |point[0]|point[1]|
+-------+-----------+--------+---------------+---------+--------+--------+
|Example|3          |12-04-19|Model example 1|#example1|123     |156     |
|Example|3          |12-05-19|Model example 2|#example2|124     |157     |
+-------+-----------+--------+---------------+---------+--------+--------+
Sign up to request clarification or add additional context in comments.

6 Comments

I think it's still have "value" column name which is not desired
The problem is you want to "merge" different fields into a single field. ie: "value1" and "value2" should be used to populate a single column in your example. I don't see how we could be more generic for this.
Hi @baitmbarek, thanks a lot your code really helps. Just one request if I have to give column name as values.value1, values.points instead of values and points...what changes I have to make.
Hi @MohammadRijwan, can you please edit your question and add an extra paragraph to specify the output you're expecting? Would be glad to help but I'm not sure about the structure :)
Hi @baitmbarek, please see I have asked this in separate question: stackoverflow.com/questions/59873760/…
|
1

I would rather suggest going with the spark in-built function. You can take advantage of the explode of a spark function to achieve this.

here is the code snippet.

scala> val df = spark.read.json(Seq(json_string).toDS)
scala> var dfd = df.select($"topic",$"total value",explode($"values").as("values"))

Here I am choosing the column based on your needs. If no column is in the dataframe, please add based on your requirement.

scala> dfd.select($"topic",$"total value",$"values.points".getItem(0)(0).as("point_0"),$"values.points".getItem(0)(1).as("point_1"),$"values.properties.date".as("_date"),$"values.properties.model".as("_model")).show
+-------+-----------+-------+-------+--------+---------------+
|  topic|total value|point_0|point_1|   _date|         _model|
+-------+-----------+-------+-------+--------+---------------+
|Example|          3|    123|    156|12-04-19|Model example 1|
|Example|          3|    124|    157|12-05-19|Model example 2|
+-------+-----------+-------+-------+--------+---------------+

If you have a limited number of columns in JSON, this approach will give you an optimal result.

2 Comments

Actually the json is coming from API so it's not certain whether the key "values" will be same or not
Exactly just a sample.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.