1

I am relatively new to Spark and I am trying to filter out invalid records from a Spark Dataset. My dataset looks something like this:

| Id | Curr| Col3 |

| 1  | USD | 1111 |
| 2  | CNY | 2222 |
| 3  | USD | 3333 |
| 1  | CNY | 4444 |

In my logic, each Id has a vaild currency. So it will basically be a map of id->currency

val map = Map(1 -> "USD", 2 -> "CNY")

I want to filter out the rows from the dataset that have Id not corresponding to the valid currency code. So after my filter operation, the dataset should look something like this:

| Id | Curr| Col3 |

| 1  | USD | 1111 |
| 2  | CNY | 2222 |

The limitation I have here is that I cannot use a UDF. Can somebody help me in coming up with a filter operation for this?

2 Answers 2

3

You can create a data frame out of the map and then do an inner join with the original data frame to filter it:

val map_df = map.toSeq.toDF("Id", "Curr")
// map_df: org.apache.spark.sql.DataFrame = [Id: int, Curr: string]

df.join(map_df, Seq("Id", "Curr")).show
+---+----+----+
| Id|Curr|Col3|
+---+----+----+
|  1| USD|1111|
|  2| CNY|2222|
+---+----+----+
Sign up to request clarification or add additional context in comments.

3 Comments

Maybe my question wasn't clear enough. The dataset might have a row with a valid Id but an invalid currency code. Like (1, CNY, 333). In this case I want to remove such entries as well. I will update my question to reflect this case.
Do you want Id and currency both match at the same time?
Yes, I basically want to keep only the rows that have valid Id and Currency information. Any row that has mismatching Id and Currency column should be removed.
-2
val a = List((1,"USD",1111),(2,"CAN",2222),(3,"USD",4444),(1,"CAN",5555))
val b = Map(1 -> "USD",2 -> "CAN")
a.filter(x => b.keys.exists(_ == x._1)).filter(y => y._2 == b(y._1))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.