Scala - Drop records from DF1 if it has matching data with column from DF2 [duplicate]

Question

I have two DF's(railroadGreaterFile, railroadInputFile).

I want to drop records from railroadGreaterFile if data in MEMBER_NUM column from railroadGreaterFile is matching the data in MEMBER_NUM column from railroadInputFile

Below is what i used:

val columnrailroadInputFile = railroadInputFile.withColumn("check", lit("check"))
val railroadGreaterNotInput = railroadGreaterFile
                               .join(columnrailroadInputFile, Seq("MEMBER_NUM"), "left")
                               .filter($"check".isNull)
                               .drop($"check")

Doing above, records are dropped, however i witnessed railroadGreaterNotInput's schema is combination of my DF1 and DF2 so when I try to write the railroadGreaterNotInput's data to file, it gives me below error

org.apache.spark.sql.AnalysisException: Reference 'GROUP_NUM' is ambiguous, could be: GROUP_NUM#508, GROUP_NUM#72

What should i be doing so that railroadGreaterNotInput would only contain fields from railroadGreaterFile DF?

You can rename the conflicting column names from railroadInputFile and just select railroadGreaterFile dataframe columns only after you join them — Anahcolus
– Anahcolus, Commented May 3, 2018 at 10:57

koiralo · Accepted Answer · 2018-05-03 11:06:22Z

2

You can only select the MEMBER_NUM while joining

val columnrailroadInputFile = railroadInputFile.withColumn("check", lit("check"))
val railroadGreaterNotInput = railroadGreaterFile.join(
    columnrailroadInputFile.select("MEMBER_NUM", "check"), Seq("MEMBER_NUM"), "left")
   .filter($"check".isNull).drop($"check")

Or drop all the columns from columnrailroadInputFile as

columnrailroadInputFile.drop(columnrailroadInputFile.columns :_*)

but for this use join contition as

columnrailroadInputFile("MEMBER_NUM") === railroadInputFile("MEMBER_NUM")

Hope this helps!

edited May 3, 2018 at 11:06

answered May 3, 2018 at 10:58

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Kiran Kumar Over a year ago

I'm almost there, now i have first column as MEMBER_NUM followed by rest of the columns, is there a way i can swap the first and second columns in railroadGreaterNotInput?

koiralo Over a year ago

I am not sure, what do you mean by swap the first and second column.

Kiran Kumar Over a year ago

Schema of railroadInputFile is GROUP_NUM, MEMBER_NUM, .... Schema of railroadGreaterFile is GROUP_NUM, MEMBER_NUM, .... The final DF railroadGreaterNotInput's Schema shows MEMBER_NUM, GROUP_NUM.... I want the final DF to be in sync with my DF1 and DF2.

koiralo Over a year ago

Do you wanna swap GROUP_NUM and MEMBER_NUM and why do you wanna do that ?

koiralo Over a year ago

You just need to use select("fields in order you want ")

|

Collectives™ on Stack Overflow

Scala - Drop records from DF1 if it has matching data with column from DF2 [duplicate]

1 Answer 1

7 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Linked

Related