dynamically populate column name in Python dataframe join

Question

I am developing a dynamic script which can join any given pyspark dataframes. The problem is the column names in file will vary & number of join conditions may vary. I can handle this in a loop but I execute the join with a variable name it fails.

(My intention is to dynamically populate a and b or more columns based on file structure and join conditions)

b="incrementalFile.Id1"
a="existingFile.Id"
unChangedRecords = existingFile.join(incrementalFile,(a==b),"left")

Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 818, in join assert isinstance(on[0], Column), "on should be Column or list of Column" AssertionError: on should be Column or list of Column

But the same code works fine if I don't place any variables in join condition as below.

unChangedRecords = existingFile.join(
    incrementalFile,
    (existingFile.Id==incrementalFile.Id1), 
    "left")

@DyZ : the reason is, the logic can be same in scala or pyspark — logan
– logan, Commented Feb 24, 2018 at 22:18

hoyland · Accepted Answer · 2018-02-24 01:18:29Z

1

In your second example, existingFile.Id is a column, not a string, but in your first example, it's a string. You want to use pyspark.sql.functions.col to reference the column by name. Its docs don't have an example, but it's used in the example for alias on the same page.

answered Feb 24, 2018 at 1:18

hoyland

1,82415 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

logan Over a year ago

you are great ! thats a good catch. Let me try and accept the answer.. +1 for now

Collectives™ on Stack Overflow

dynamically populate column name in Python dataframe join

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related