Spark -scala create DataFrame from json file and drop basic and nested elements

Question

I have json file that contains json objects, each object by line. I have the folowing schema for these objects :

root
   |-- endtime: long (nullable = true)
   |-- result: array (nullable = true)
   |    |-- element: struct (containsNull = true)
   |    |    |-- hop: long (nullable = true)
   |    |    |-- result: array (nullable = true)
   |    |    |    |-- element: struct (containsNull = true)
   |    |    |    |    |-- from: string (nullable = true)
   |    |    |    |    |-- rtt: double (nullable = true)
   |    |    |    |    |-- size: long (nullable = true)
   |    |    |    |    |-- ttl: long (nullable = true)
   |    |    |    |    |-- x: string (nullable = true)

The question : How I can create a new DataFrame from Dataframe containing the data in the json file given as input and deleting data as ttl and x?

   |    |    |    |    |-- ttl: long (nullable = true)
   |    |    |    |    |-- x: string (nullable = true)

Given that I am new in Spark (Scala), I don't know what are the possile ways!

It was simple to delete endtime by :

val pathToTraceroutesExamples = getClass.getResource("/test/sample_1.json")
val df = spark.read.json(pathToTraceroutesExamples.getPath)

// Displays the content of the DataFrame to stdout
df.show()
df.printSchema()

var newDf = df.drop("endtime")

Kris · Accepted Answer · 2018-10-30 11:50:35Z

1

explode and drop will do the trick. First, explode the first level result and then explode the second level result from the resulting dataframe. Finally drop the columns.

For instance,

val newDF = df
  .select(df(“*”), explode(df(“result”)).alias(“result_exp”))
  .drop(“ttl”).drop(“x”)

answered Oct 30, 2018 at 11:50

Kris

1,7641 gold badge14 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Hayat Bellafkih · Accepted Answer · 2018-10-31 12:08:32Z

The idea of @Kris is True; explode and then drop. I found an example here.

I changed the attribute name result because I have another result name to avoid the confusion at the explode :

Step 1: (Input)

 |-- timestamp: long (nullable = true)
 |-- hopDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- hop: long (nullable = true)
 |    |    |-- result: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- from: string (nullable = true)
 |    |    |    |    |-- rtt: double (nullable = true)
 |    |    |    |    |-- size: long (nullable = true)
 |    |    |    |    |-- ttl: long (nullable = true)

Step 2: Code:

    var exploded_1 = renamed_newDF
             .withColumn("hop", explode(renamed_newDF("hopDetails.hop")))
             .withColumn("result", explode(renamed_newDF("hopDetails.result")))
             .drop("hopDetails")
    exploded_1.printSchema

Output schema :

 |-- timestamp: long (nullable = true)
 |-- hop: long (nullable = true)
 |-- result: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- from: string (nullable = true)
 |    |    |-- rtt: double (nullable = true)
 |    |    |-- size: long (nullable = true)
 |    |    |-- ttl: long (nullable = true)

Step 3 :

Code :

var exploded_2 = exploded_1
  .withColumn("from", explode(exploded_1("result.from")))
  .withColumn("rtt", explode(exploded_1("result.rtt")))
  .withColumn("size", explode(exploded_1("result.size")))
  .withColumn("ttl", explode(exploded_1("result.ttl")))
  .drop("result")

exploded_2.printSchema

Schema :

    root
   |-- af: long (nullable = true)
   |-- dst_addr: string (nullable = true)
   |-- from: string (nullable = true)
   |-- msm_id: long (nullable = true)
   |-- prb_id: long (nullable = true)
   |-- src_addr: string (nullable = true)
   |-- timestamp: long (nullable = true)
   |-- hop: long (nullable = true)
   |-- rtt: double (nullable = true)
   |-- size: long (nullable = true)
   |-- ttl: long (nullable = true)

Collectives™ on Stack Overflow

Spark -scala create DataFrame from json file and drop basic and nested elements

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related