1

I am new with Apache Spark, I am using Scala to work with it.

I have some doubts and one of them is how to use case when for my example. I have to work with distributed systems and what I have to do is classify some instances. To do this I have a DataFrame as you can see here:

+--------------------+------------+
|       group        |   info     |
+--------------------+------------+
|gr=nat, dfn         |   x1betdfn |
|gr=pjc, ntp         |   x2b1trkn |
|gr=ntp, ntt         |   x3b2td3n |
+--------------------+------------+

So through the column called group I have to filter or do a regex to get just the three letters after gr= (would be nat, pjc or ntp) and then write it in a new column.

What is the most efficient way to do this?

This is what I am doing:

val df2 =
  df.withColumn("tgroup", when(col("group").match === "nat", "nat_1_nm")
  .when(col("group").match === "pjc", "pjc_determined")
  .when(col("group").match === "ntp", "ntp_dway")
  .otherwise("Unknown"))

But is not working. I am trying to get this:

+--------------------+------------+----------------+
|       group        |   info     |                |
+--------------------+------------+----------------+
|gr=nat, dfn         |   x1betdfn | nat_1_nm       |
|gr=pjc, ntp         |   x2b1trkn | pjc_determined |
|gr=e2p, ntt         |   x3b2td3n | Unknown        |
|gr=ntp, ntt         |   x3b2td3n | ntp_dway       |
+--------------------+------------+----------------+

What am I doing wrong? Thanks in advance.

0

2 Answers 2

1

Try to do something like this, using rlike. Just build your own regexp:

val df2 =
      df.withColumn("tgroup",when(col("group").rlike("(?i)^*nat,\\s[a-zA-Z]*$"), "nat_1_nm").
        otherwise(when(col("group").rlike("(?i)^*pjc,\\s[a-zA-Z]*$"), "pjc_determined").
          otherwise(when(col("group").rlike("(?i)^*ntp,\\s[a-zA-Z]*$"), "ntp_dway").otherwise("Unknown"))))

I tried it with an example and I am getting your expected output:

import spark.implicits._
    val df = Seq(
      ("gr=nat, dfn"),
      ("gr=pjc, ntp"),
      ("gr=e2p, ntt"),
      ("gr=ntp, ntt")
    ).toDF("group")
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks a lot for your help, I can't test it today so tomorrow I will tell you if is working for me. In rlike can you use differents Regex, right?
Yep. Take a look at this post mungingdata.com/apache-spark/…. Depending on what you want to do, maybe you could use "contains" instead of "rlike".
@CloudSparkie did you try it?
0

If the values of group column always of the form gr=XXX, using startsWith method should be enough in your case :

val df2 = df.withColumn(
  "tgroup",
  when(col("group").startsWith("gr=nat"), "nat_1_nm")
    .when(col("group").startsWith("gr=pjc"), "pjc_determined")
    .when(col("group").startsWith("gr=ntp"), "ntp_dway")
    .otherwise("Unknown")
)

df2.show
//+-----------+--------+--------------+
//|      group|    info|        tgroup|
//+-----------+--------+--------------+
//|gr=nat, dfn|x1betdfn|      nat_1_nm|
//|gr=pjc, ntp|x2b1trkn|pjc_determined|
//|gr=e2p, ntt|x3b2td3n|       Unknown|
//|gr=ntp, ntt|x3b2td3n|      ntp_dway|
//+-----------+--------+--------------+

You may also want to extract the value XXX after gr= using regexp_extract function:

val df2 = df.withColumn(
    "tgroup",
    regexp_extract(col("group"), "^gr=(.{3}),.*", 1)
).withColumn(
    "tgroup",
    when(col("tgroup") === "nat", "nat_1_nm")
      .when(col("tgroup") === "pjc", "pjc_determined")
      .when(col("tgroup") === "ntp", "ntp_dway")
      .otherwise("Unknown")
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.