I have two dataframes like below, and I need to merge them based on matching rows.
Dataframe 1
| ID | status |
|---|---|
| V1 | Low |
| V2 | Low |
| V3 | Low |
Dataframe 2
| ID | status |
|---|---|
| V1 | High |
| V2 | High |
| V6 | High |
Expected dataframe like below
| ID | status |
|---|---|
| V1 | Low |
| V1 | High |
| V2 | Low |
| V2 | High |
(I only know Java, not Scala, sorry)
I would say, if I call :
your dataset 1: A
and dataset 2: B
Column joinClause = A.col("ID").equalTo(B.col("ID"));
Dataset A_with_B = A.join(B, joinClause, "left_semi")
.union(
B.join(A, joinClause, "left_semi")
);
One option is to do an inner join and take a union of the resulting dataframe
import org.apache.spark.sql.SparkSession
object dev extends App{
val spark = SparkSession.builder()
.appName("Join and Stack Example")
.master("local[*]")
.getOrCreate()
import spark.implicits._ // for creating the DataFrames
val df1 = Seq(
("V1", "Low"),
("V2", "Low"),
("V3", "Low")
).toDF("ID", "Status")
val df2 = Seq(
("V1", "High"),
("V2", "High"),
("V6", "High")
).toDF("ID", "Status")
val joined = df1.as("left")
.join(df2.as("right"), Seq("ID"), "inner")
.select(
$"ID",
$"left.Status".as("Status_left"),
$"right.Status".as("Status_right")
)
val leftStatus = joined.select($"ID", $"Status_left".as("Status"))
val rightStatus = joined.select($"ID", $"Status_right".as("Status"))
val stacked = leftStatus.union(rightStatus)
// optionally sort if you want the exact same output as you had
stacked.sort($"ID").show()
}