Extract Nested Array of Struct with If Else Logic

Question

I need to derive two new fields from the below schema structure - new columns beaver_id and llama_id. There is some if else logic that needs to be applied to an array of struct. The desired end result is a csv output. What is the best approach for this?

Schema:

root
 |-- Animal: struct (nullable = true)
 |    |-- Species: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- color: string (nullable = true)
 |    |    |    |-- unique_id: string (nullable = true)

Pseudo Code:

If name == "Beaver" 
   then get unique_id and put in dataframe column "beaver_id"
     else
   null in column "beaver_id"

If name == "Llama"
   then get unique_id and put in dataframe column "llama_id"
     else
   null in column "llama_id"

If array of names does not contain "Llama" or "Beaver"
   then null for both "beaver_id" and "llama_id"

Currently: I am using select dataframe function to select elements out of the input (parquet) to create a csv output. I am extracting many other elements using this approach besides the ones in this question.

var select_df = raw_df.select(
  col(Animal.Species.name).getField("name")
  col(Animal.Species.color).getField("color")
)

Example Input (in JSON), Actual input is parquet:

{
  "Animal": {
    "Species": [
      {
        "name": "Beaver",
        "color": "red",
        "unique_id": "1001"
      },
      {
        "name": "Llama",
        "color": "blue",
        "unique_id": "2222"
      }
    ]
  }
}

Expected csv output:

beaver_id, llama_id
1001, 2222

blackbishop · Accepted Answer · 2022-04-09 12:32:33Z

1

+200

You can use filter function on Animal.Species array column like this:

val select_df = raw_df.select(
  element_at(expr("filter(Animal.Species, x -> x.name = 'Beaver')"), 1)
    .getField("unique_id")
    .as("beaver_id"),
  element_at(expr("filter(Animal.Species, x -> x.name = 'Llama')"), 1)
    .getField("unique_id")
    .as("llama_id")
)

select_df.show
//+---------+--------+
//|beaver_id|llama_id|
//+---------+--------+
//|     1001|    2222|
//+---------+--------+

The logic is quite simple, we filter the array to find the inner struct which holds name = Beaver|Llama and get its unique_id. If not found null is returned.

Note that since Spark 3, you can also use the filter function within DataFrame API.

answered Apr 9, 2022 at 12:32

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Defcon Over a year ago

I think your solution is pretty clean. In Spark 3, using the filter function, would the syntax be the same except no expr function?

blackbishop Over a year ago

@Defcon yes it's the same syntax but in Scala API: filter(col("Animal.Species"), x => x.getField("name") === lit("Beaver"))

Defcon Over a year ago

Is there a way to filter on an additional field using this same method? For example, if I wanted to know Beavers that are Blue.

Johan Tovar · Accepted Answer · 2022-04-08 20:18:29Z

0

When you have to work with a large number of checks or with complex checks, it is always best to use a pattern strategy that simplifies this task, as well as adds scalability in case you want to add more checks.

In this sense, it seems more convenient to me that you create a context that determines what kind of "Species" is the record that you are verifying, in order to apply the logic you need, in this case, it would be the extraction of the required value and its subsequent assignment to the field determined by the context.

If I understood your question correctly, I think this would be the best approach for this task, let me know if my opinion fits your case and if you have any questions about it.

answered Apr 8, 2022 at 20:18

Johan Tovar

262 bronze badges

Collectives™ on Stack Overflow

Extract Nested Array of Struct with If Else Logic

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related