1

I have a column that is a list of identifiers (in this case runways). It could be an array or a comma-separated list, in this example I'm converting it to an array. I'm trying to figure out the idiomatic/programmatic way to update a set of columns based on the contents of said array.

Working example that uses anti-pattern:

        val data = Seq("08L,08R,09")
        val df = data.toDF("runways")
          .withColumn("runway_set", split('runways, ","))
          .withColumn("runway_in_use_08L", when(array_contains('runway_set, "08L"), 1).otherwise(0))
          .withColumn("runway_in_use_26R", when(array_contains('runway_set, "26R"), 1).otherwise(0))
          .withColumn("runway_in_use_08R", when(array_contains('runway_set, "08R"), 1).otherwise(0))
          .withColumn("runway_in_use_26L", when(array_contains('runway_set, "26L"), 1).otherwise(0))
          .withColumn("runway_in_use_09", when(array_contains('runway_set, "09"), 1).otherwise(0))
          .withColumn("runway_in_use_27", when(array_contains('runway_set, "27"), 1).otherwise(0))
          .withColumn("runway_in_use_15L", when(array_contains('runway_set, "15L"), 1).otherwise(0))
          .withColumn("runway_in_use_33R", when(array_contains('runway_set, "33R"), 1).otherwise(0))
          .withColumn("runway_in_use_15R", when(array_contains('runway_set, "15R"), 1).otherwise(0))
          .withColumn("runway_in_use_33L", when(array_contains('runway_set, "33L"), 1).otherwise(0))

This produces essentially one-hot encoded values like so:

+----------+--------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+
|   runways|    runway_set|runway_in_use_08L|runway_in_use_26R|runway_in_use_08R|runway_in_use_26L|runway_in_use_09|runway_in_use_27|runway_in_use_15L|runway_in_use_33R|runway_in_use_15R|runway_in_use_33L|
+----------+--------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+
|08L,08R,09|[08L, 08R, 09]|                1|                0|                1|                0|               1|               0|                0|                0|                0|                0|
+----------+--------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+

Feels like I should be able to take a static sequence of all identifiers and perform some programmatic operation to do all of the above in a loop/map/foreach type of expression, but I am not sure how to formulate it.

E.g.:

val all_runways = Seq("08L","26R","08R","26L","09","27","15L","33R","15R","33L")
// iterate through and update each column, e.g. 'runway_in_use_$i'

Any pointers? Thanks in advance.

1 Answer 1

2

Typical use case for fold.

val df = data.toDF("runways")
  .withColumn("runway_set", split('runways, ","))

val df2 = all_runways.foldLeft(df) { (acc, x) =>
  acc.withColumn(s"runway_in_use_$x", when(array_contains('runway_set, x), 1).otherwise(0))
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.