I'm trying to filter df1 by joining df2 based on some column and then filter some rows from df1 based on filter.
df1:
+---------------+----------+
| channel|rag_status|
+---------------+----------+
| STS| green|
|Rapid Cash Plus| green|
| DOTOPAL| green|
| RAPID CASH| green|
df2:
+---------------+----------+
| channel|rag_status|
+---------------+----------+
| STS| blue|
|Rapid Cash Plus| blue|
| DOTOPAL| blue|
+---------------+----------+
Sample code is:
df1.join(df2, df1.col("channel") === df2.col("channel"), "leftouter")
.filter(not(df1.col("rag_status") === "green"))
.select(df1.col("channel"), df1.col("rag_status")).show
Its not returning any records.
I'm expecting the output as below one, which is returned from df1 after filtering the records based on channel and green status condition. If the same channel is available in the df2 and the df1 rag_status is green, then remove that record from df1 and return the remaining records only from df1.
Expected output is:
+---------------+----------+
| channel|rag_status|
+---------------+----------+
| RAPID CASH| green|
rag_statusother than green in your first dataframe, so the filterfilter(not(df1.col("rag_status") === "green"))works as expected because you ask for the rows whererag_statusis NOT green indf1, and there are none.