0

I want to perform left outer join on Dataset using spark Java API. How to write dynamic condition to match the multiple columns in join condition.

I am having two dataset objects. Both of them having 2 or more columns. I am not able to define condition

Example which match 1 column with another

dataSet = resultData.as("resultData").join(distinctData.as("distinctData"), resultData.col("A").equalTo(distinctData.col("B")), "leftouter").selectExpr(select.toString());

Now Since there are multiple column I am not able to define dynamic expression for matching the multiple columns using Java API.

4
  • you probably got a downvote because you haven't included anything about what your data looks like, or what you've tried so far. I'd be happy to help you if you could provide that information. Commented Apr 23, 2019 at 13:25
  • Edited the question Commented Apr 23, 2019 at 13:33
  • do you get an error? what happens when you run the code above? Commented Apr 23, 2019 at 14:05
  • for example mentioned in the question i don't get any error. Issue is I want to specify condition for matching multiple columns and I am not able to find any reference to define the same. Commented Apr 23, 2019 at 14:12

1 Answer 1

3

Untested code - but this dynamically generates a join condition from a list of column names

public Column makeJoinConditional(Dataset<Row> df1, Dataset<Row> df2, List<String> columnNames, Column c)  {

        if (c==null) {
            String  top = columnNames.get(0);
            columnNames.remove(0);
            Column first = df1.col(top).equalTo(df2.col(top));

            return makeJoinConditional(df1,df2, columnNames,first);

        } else {

            if (columnNames.size()==0) {
                return c;
            } else {
                String  top = columnNames.get(0);
                columnNames.remove(0);
                Column next = c.and( df1.col(top).equalTo(df2.col(top)) );
                return makeJoinConditional(df1,df2, columnNames,next);
            }
        }
    }

    public Dataset<Row> joinDataFrames(Dataset<Row> df1, Dataset<Row> df2, List<String> columns) {
        Column joinCols = makeJoinConditional(df1,df2,columns,null);
        return df1.join(df2,joinCols);
    }
Sign up to request clarification or add additional context in comments.

4 Comments

Yes, This will work when am having fixed number of columns. But column number varies with each scenario and I can't change code every time for this :)
so, you need a function, that given a list of columns, can generate the conditional statement?
Yes, Somewhat similar to you said. But dataste.join() accepts column, columnExpr and seq. I am trying to find what is columnExpr which will return the conditional column/ statement.
ok, updated answer to dynamically generate conditional based on a list of column names

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.