2

I have below JSON structure in my dataframe as a body attribute. I would like to drop multiple columns/attributes from the content based on provided list, how can I do this in scala ?

Note that the list of attributes is variable in nature.

Let Say,

List of columns to drop : List(alias, firstName, lastName)

Input

  "Content":{
     "alias":"Jon",
     "firstName":"Jonathan",
     "lastName":"Mathew",
     "displayName":"Jonathan Mathew",
     "createdDate":"2021-08-10T13:06:35.866Z",
     "updatedDate":"2021-08-10T13:06:35.866Z",
     "isDeleted":false,
     "address":"xx street",
     "phone":"xxx90"
  }

Output :

"Content":{
     "displayName":"Jonathan Mathew",
     "createdDate":"2021-08-10T13:06:35.866Z",
     "updatedDate":"2021-08-10T13:06:35.866Z",
     "isDeleted":false,
     "address":"xx street",
     "phone":"xxx90"
  }
1

2 Answers 2

1

You can get the list of attributes from the dataframe schema then update the column Content by creating a struct with all attributes but those in your list of columns to drop.

Here's a complete working example:

val jsonStr = """{"id": 1,"Content":{"alias":"Jon","firstName":"Jonathan","lastName":"Mathew","displayName":"Jonathan Mathew","createdDate":"2021-08-10T13:06:35.866Z","updatedDate":"2021-08-10T13:06:35.866Z","isDeleted":false,"address":"xx street","phone":"xxx90"}}"""

val df = spark.read.json(Seq(jsonStr).toDS)

val attrToDrop = Seq("alias", "firstName", "lastName")

val contentAttrList = df.select("Content.*").columns

val df2 = df.withColumn(
  "Content",
  struct(
    contentAttrList
      .filter(!attrToDrop.contains(_))
      .map(c => col(s"Content.$c")): _*
  )
)

df2.printSchema
//root
// |-- Content: struct (nullable = false)
// |    |-- address: string (nullable = true)
// |    |-- createdDate: string (nullable = true)
// |    |-- displayName: string (nullable = true)
// |    |-- isDeleted: boolean (nullable = true)
// |    |-- phone: string (nullable = true)
// |    |-- updatedDate: string (nullable = true)
// |-- id: long (nullable = true)

Sign up to request clarification or add additional context in comments.

Comments

0

You can use drop to drop multiple columns at once :

val newDataframe = oldDataframe.drop("alias", "firstName", "lastName")

Documentation :

/**
   * Returns a new Dataset with columns dropped.
   * This is a no-op if schema doesn't contain column name(s).
   *
   * This method can only be used to drop top level columns. the colName string is treated literally
   * without further interpretation.
   *
   * @group untypedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def drop(colNames: String*): DataFrame 

1 Comment

thanks, here attributes are packed inside JSON body, direct using "drop" won't work..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.