0

I am developing a dynamic script which can join any given pyspark dataframes. The problem is the column names in file will vary & number of join conditions may vary. I can handle this in a loop but I execute the join with a variable name it fails.

(My intention is to dynamically populate a and b or more columns based on file structure and join conditions)

b="incrementalFile.Id1"
a="existingFile.Id"
unChangedRecords = existingFile.join(incrementalFile,(a==b),"left") 

Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 818, in join assert isinstance(on[0], Column), "on should be Column or list of Column" AssertionError: on should be Column or list of Column

But the same code works fine if I don't place any variables in join condition as below.

unChangedRecords = existingFile.join(
    incrementalFile,
    (existingFile.Id==incrementalFile.Id1), 
    "left")
2
  • 1
    Why is this tagged 'scala'? Commented Feb 24, 2018 at 0:46
  • @DyZ : the reason is, the logic can be same in scala or pyspark Commented Feb 24, 2018 at 22:18

1 Answer 1

1

In your second example, existingFile.Id is a column, not a string, but in your first example, it's a string. You want to use pyspark.sql.functions.col to reference the column by name. Its docs don't have an example, but it's used in the example for alias on the same page.

Sign up to request clarification or add additional context in comments.

1 Comment

you are great ! thats a good catch. Let me try and accept the answer.. +1 for now

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.