1

I have json file that contains json objects, each object by line. I have the folowing schema for these objects :

root
   |-- endtime: long (nullable = true)
   |-- result: array (nullable = true)
   |    |-- element: struct (containsNull = true)
   |    |    |-- hop: long (nullable = true)
   |    |    |-- result: array (nullable = true)
   |    |    |    |-- element: struct (containsNull = true)
   |    |    |    |    |-- from: string (nullable = true)
   |    |    |    |    |-- rtt: double (nullable = true)
   |    |    |    |    |-- size: long (nullable = true)
   |    |    |    |    |-- ttl: long (nullable = true)
   |    |    |    |    |-- x: string (nullable = true)

The question : How I can create a new DataFrame from Dataframe containing the data in the json file given as input and deleting data as ttl and x?

   |    |    |    |    |-- ttl: long (nullable = true)
   |    |    |    |    |-- x: string (nullable = true)

Given that I am new in Spark (Scala), I don't know what are the possile ways!

It was simple to delete endtime by :

val pathToTraceroutesExamples = getClass.getResource("/test/sample_1.json")
val df = spark.read.json(pathToTraceroutesExamples.getPath)

// Displays the content of the DataFrame to stdout
df.show()
df.printSchema()

var newDf = df.drop("endtime")

2 Answers 2

1

explode and drop will do the trick. First, explode the first level result and then explode the second level result from the resulting dataframe. Finally drop the columns.

For instance,

val newDF = df
  .select(df(“*”), explode(df(“result”)).alias(“result_exp”))
  .drop(“ttl”).drop(“x”)
Sign up to request clarification or add additional context in comments.

Comments

0

The idea of @Kris is True; explode and then drop. I found an example here.

I changed the attribute name result because I have another result name to avoid the confusion at the explode :

Step 1: (Input)

 |-- timestamp: long (nullable = true)
 |-- hopDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- hop: long (nullable = true)
 |    |    |-- result: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- from: string (nullable = true)
 |    |    |    |    |-- rtt: double (nullable = true)
 |    |    |    |    |-- size: long (nullable = true)
 |    |    |    |    |-- ttl: long (nullable = true)

Step 2: Code:

    var exploded_1 = renamed_newDF
             .withColumn("hop", explode(renamed_newDF("hopDetails.hop")))
             .withColumn("result", explode(renamed_newDF("hopDetails.result")))
             .drop("hopDetails")
    exploded_1.printSchema

Output schema :

 |-- timestamp: long (nullable = true)
 |-- hop: long (nullable = true)
 |-- result: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- from: string (nullable = true)
 |    |    |-- rtt: double (nullable = true)
 |    |    |-- size: long (nullable = true)
 |    |    |-- ttl: long (nullable = true)

Step 3 :

Code :

var exploded_2 = exploded_1
  .withColumn("from", explode(exploded_1("result.from")))
  .withColumn("rtt", explode(exploded_1("result.rtt")))
  .withColumn("size", explode(exploded_1("result.size")))
  .withColumn("ttl", explode(exploded_1("result.ttl")))
  .drop("result")

exploded_2.printSchema

Schema :

    root
   |-- af: long (nullable = true)
   |-- dst_addr: string (nullable = true)
   |-- from: string (nullable = true)
   |-- msm_id: long (nullable = true)
   |-- prb_id: long (nullable = true)
   |-- src_addr: string (nullable = true)
   |-- timestamp: long (nullable = true)
   |-- hop: long (nullable = true)
   |-- rtt: double (nullable = true)
   |-- size: long (nullable = true)
   |-- ttl: long (nullable = true)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.