1

I have two DF's(railroadGreaterFile, railroadInputFile).

I want to drop records from railroadGreaterFile if data in MEMBER_NUM column from railroadGreaterFile is matching the data in MEMBER_NUM column from railroadInputFile

Below is what i used:

val columnrailroadInputFile = railroadInputFile.withColumn("check", lit("check"))
val railroadGreaterNotInput = railroadGreaterFile
                               .join(columnrailroadInputFile, Seq("MEMBER_NUM"), "left")
                               .filter($"check".isNull)
                               .drop($"check")

Doing above, records are dropped, however i witnessed railroadGreaterNotInput's schema is combination of my DF1 and DF2 so when I try to write the railroadGreaterNotInput's data to file, it gives me below error

org.apache.spark.sql.AnalysisException: Reference 'GROUP_NUM' is ambiguous, could be: GROUP_NUM#508, GROUP_NUM#72

What should i be doing so that railroadGreaterNotInput would only contain fields from railroadGreaterFile DF?

1
  • You can rename the conflicting column names from railroadInputFile and just select railroadGreaterFile dataframe columns only after you join them Commented May 3, 2018 at 10:57

1 Answer 1

2

You can only select the MEMBER_NUM while joining

val columnrailroadInputFile = railroadInputFile.withColumn("check", lit("check"))
val railroadGreaterNotInput = railroadGreaterFile.join(
    columnrailroadInputFile.select("MEMBER_NUM", "check"), Seq("MEMBER_NUM"), "left")
   .filter($"check".isNull).drop($"check")

Or drop all the columns from columnrailroadInputFile as

columnrailroadInputFile.drop(columnrailroadInputFile.columns :_*)

but for this use join contition as

columnrailroadInputFile("MEMBER_NUM") === railroadInputFile("MEMBER_NUM")

Hope this helps!

Sign up to request clarification or add additional context in comments.

7 Comments

I'm almost there, now i have first column as MEMBER_NUM followed by rest of the columns, is there a way i can swap the first and second columns in railroadGreaterNotInput?
I am not sure, what do you mean by swap the first and second column.
Schema of railroadInputFile is GROUP_NUM, MEMBER_NUM, .... Schema of railroadGreaterFile is GROUP_NUM, MEMBER_NUM, .... The final DF railroadGreaterNotInput's Schema shows MEMBER_NUM, GROUP_NUM.... I want the final DF to be in sync with my DF1 and DF2.
Do you wanna swap GROUP_NUM and MEMBER_NUM and why do you wanna do that ?
You just need to use select("fields in order you want ")
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.