0

I need to derive two new fields from the below schema structure - new columns beaver_id and llama_id. There is some if else logic that needs to be applied to an array of struct. The desired end result is a csv output. What is the best approach for this?

Schema:

root
 |-- Animal: struct (nullable = true)
 |    |-- Species: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- color: string (nullable = true)
 |    |    |    |-- unique_id: string (nullable = true)

Pseudo Code:

If name == "Beaver" 
   then get unique_id and put in dataframe column "beaver_id"
     else
   null in column "beaver_id"

If name == "Llama"
   then get unique_id and put in dataframe column "llama_id"
     else
   null in column "llama_id"

If array of names does not contain "Llama" or "Beaver"
   then null for both "beaver_id" and "llama_id"

Currently: I am using select dataframe function to select elements out of the input (parquet) to create a csv output. I am extracting many other elements using this approach besides the ones in this question.

var select_df = raw_df.select(
  col(Animal.Species.name).getField("name")
  col(Animal.Species.color).getField("color")
)

Example Input (in JSON), Actual input is parquet:

{
  "Animal": {
    "Species": [
      {
        "name": "Beaver",
        "color": "red",
        "unique_id": "1001"
      },
      {
        "name": "Llama",
        "color": "blue",
        "unique_id": "2222"
      }
    ]
  }
}

Expected csv output:

beaver_id, llama_id
1001, 2222

2 Answers 2

1
+200

You can use filter function on Animal.Species array column like this:

val select_df = raw_df.select(
  element_at(expr("filter(Animal.Species, x -> x.name = 'Beaver')"), 1)
    .getField("unique_id")
    .as("beaver_id"),
  element_at(expr("filter(Animal.Species, x -> x.name = 'Llama')"), 1)
    .getField("unique_id")
    .as("llama_id")
)

select_df.show
//+---------+--------+
//|beaver_id|llama_id|
//+---------+--------+
//|     1001|    2222|
//+---------+--------+

The logic is quite simple, we filter the array to find the inner struct which holds name = Beaver|Llama and get its unique_id. If not found null is returned.

Note that since Spark 3, you can also use the filter function within DataFrame API.

Sign up to request clarification or add additional context in comments.

3 Comments

I think your solution is pretty clean. In Spark 3, using the filter function, would the syntax be the same except no expr function?
@Defcon yes it's the same syntax but in Scala API: filter(col("Animal.Species"), x => x.getField("name") === lit("Beaver"))
Is there a way to filter on an additional field using this same method? For example, if I wanted to know Beavers that are Blue.
0

When you have to work with a large number of checks or with complex checks, it is always best to use a pattern strategy that simplifies this task, as well as adds scalability in case you want to add more checks.

In this sense, it seems more convenient to me that you create a context that determines what kind of "Species" is the record that you are verifying, in order to apply the logic you need, in this case, it would be the extraction of the required value and its subsequent assignment to the field determined by the context.

If I understood your question correctly, I think this would be the best approach for this task, let me know if my opinion fits your case and if you have any questions about it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.