I am new with Apache Spark, I am using Scala to work with it.
I have some doubts and one of them is how to use case when for my example. I have to work with distributed systems and what I have to do is classify some instances. To do this I have a DataFrame as you can see here:
+--------------------+------------+
| group | info |
+--------------------+------------+
|gr=nat, dfn | x1betdfn |
|gr=pjc, ntp | x2b1trkn |
|gr=ntp, ntt | x3b2td3n |
+--------------------+------------+
So through the column called group I have to filter or do a regex to get just the three letters after gr= (would be nat, pjc or ntp) and then write it in a new column.
What is the most efficient way to do this?
This is what I am doing:
val df2 =
df.withColumn("tgroup", when(col("group").match === "nat", "nat_1_nm")
.when(col("group").match === "pjc", "pjc_determined")
.when(col("group").match === "ntp", "ntp_dway")
.otherwise("Unknown"))
But is not working. I am trying to get this:
+--------------------+------------+----------------+
| group | info | |
+--------------------+------------+----------------+
|gr=nat, dfn | x1betdfn | nat_1_nm |
|gr=pjc, ntp | x2b1trkn | pjc_determined |
|gr=e2p, ntt | x3b2td3n | Unknown |
|gr=ntp, ntt | x3b2td3n | ntp_dway |
+--------------------+------------+----------------+
What am I doing wrong? Thanks in advance.