Using case when in Spark Scala

Question

I am new with Apache Spark, I am using Scala to work with it.

I have some doubts and one of them is how to use case when for my example. I have to work with distributed systems and what I have to do is classify some instances. To do this I have a DataFrame as you can see here:

+--------------------+------------+
|       group        |   info     |
+--------------------+------------+
|gr=nat, dfn         |   x1betdfn |
|gr=pjc, ntp         |   x2b1trkn |
|gr=ntp, ntt         |   x3b2td3n |
+--------------------+------------+

So through the column called group I have to filter or do a regex to get just the three letters after gr= (would be nat, pjc or ntp) and then write it in a new column.

What is the most efficient way to do this?

This is what I am doing:

val df2 =
  df.withColumn("tgroup", when(col("group").match === "nat", "nat_1_nm")
  .when(col("group").match === "pjc", "pjc_determined")
  .when(col("group").match === "ntp", "ntp_dway")
  .otherwise("Unknown"))

But is not working. I am trying to get this:

+--------------------+------------+----------------+
|       group        |   info     |                |
+--------------------+------------+----------------+
|gr=nat, dfn         |   x1betdfn | nat_1_nm       |
|gr=pjc, ntp         |   x2b1trkn | pjc_determined |
|gr=e2p, ntt         |   x3b2td3n | Unknown        |
|gr=ntp, ntt         |   x3b2td3n | ntp_dway       |
+--------------------+------------+----------------+

What am I doing wrong? Thanks in advance.

sylvinho81 · Accepted Answer · 2021-11-15 12:44:00Z

1

Try to do something like this, using rlike. Just build your own regexp:

val df2 =
      df.withColumn("tgroup",when(col("group").rlike("(?i)^*nat,\\s[a-zA-Z]*$"), "nat_1_nm").
        otherwise(when(col("group").rlike("(?i)^*pjc,\\s[a-zA-Z]*$"), "pjc_determined").
          otherwise(when(col("group").rlike("(?i)^*ntp,\\s[a-zA-Z]*$"), "ntp_dway").otherwise("Unknown"))))

I tried it with an example and I am getting your expected output:

import spark.implicits._
    val df = Seq(
      ("gr=nat, dfn"),
      ("gr=pjc, ntp"),
      ("gr=e2p, ntt"),
      ("gr=ntp, ntt")
    ).toDF("group")

answered Nov 15, 2021 at 12:44

sylvinho81

3891 gold badge5 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

CloudSparkie Over a year ago

Thanks a lot for your help, I can't test it today so tomorrow I will tell you if is working for me. In rlike can you use differents Regex, right?

sylvinho81 Over a year ago

Yep. Take a look at this post mungingdata.com/apache-spark/…. Depending on what you want to do, maybe you could use "contains" instead of "rlike".

sylvinho81 Over a year ago

@CloudSparkie did you try it?

blackbishop · Accepted Answer · 2021-11-15 20:25:22Z

If the values of group column always of the form gr=XXX, using startsWith method should be enough in your case :

val df2 = df.withColumn(
  "tgroup",
  when(col("group").startsWith("gr=nat"), "nat_1_nm")
    .when(col("group").startsWith("gr=pjc"), "pjc_determined")
    .when(col("group").startsWith("gr=ntp"), "ntp_dway")
    .otherwise("Unknown")
)

df2.show
//+-----------+--------+--------------+
//|      group|    info|        tgroup|
//+-----------+--------+--------------+
//|gr=nat, dfn|x1betdfn|      nat_1_nm|
//|gr=pjc, ntp|x2b1trkn|pjc_determined|
//|gr=e2p, ntt|x3b2td3n|       Unknown|
//|gr=ntp, ntt|x3b2td3n|      ntp_dway|
//+-----------+--------+--------------+

You may also want to extract the value XXX after gr= using regexp_extract function:

val df2 = df.withColumn(
    "tgroup",
    regexp_extract(col("group"), "^gr=(.{3}),.*", 1)
).withColumn(
    "tgroup",
    when(col("tgroup") === "nat", "nat_1_nm")
      .when(col("tgroup") === "pjc", "pjc_determined")
      .when(col("tgroup") === "ntp", "ntp_dway")
      .otherwise("Unknown")
)

Collectives™ on Stack Overflow

Using case when in Spark Scala

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related