0

I have two DataFrames recommendations and movies. Columns rec1-rec3 in recommendations represent movie id from movies dataframe.

val recommendations: DataFrame = List(
        (0, 1, 2, 3),
        (1, 2, 3, 4),
        (2, 1, 3, 4)).toDF("id", "rec1", "rec2", "rec3")

val movies = List(
        (1, "the Lord of the Rings"),
        (2, "Star Wars"),
        (3, "Star Trek"),
        (4, "Pulp Fiction")).toDF("id", "name")

What I want:

+---+------------------------+------------+------------+
| id|                    rec1|        rec2|        rec3|
+---+------------------------+------------+------------+
|  0|   the Lord of the Rings|   Star Wars|   Star Trek|
|  1|               Star Wars|   Star Trek|Pulp Fiction|
|  2|   the Lord of the Rings|   Star Trek|   Star Trek|
+---+------------------------+------------+------------+

2 Answers 2

5

We can also use the functions stack() and pivot() to arrive at your expected output, joining the two dataframes only once.

// First rename 'id' column to 'ids' avoid duplicate names further downstream
val moviesRenamed = movies.withColumnRenamed("id", "ids")

recommendations.select($"id", expr("stack(3, 'rec1', rec1, 'rec2', rec2, 'rec3', rec3) as (rec, movie_id)"))
  .where("rec is not null")
  .join(moviesRenamed, col("movie_id") === moviesRenamed.col("ids"))
  .groupBy("id")
  .pivot("rec")
  .agg(first("name"))
  .show()
+---+--------------------+---------+------------+
| id|                rec1|     rec2|        rec3|
+---+--------------------+---------+------------+
|  0|the Lord of the R...|Star Wars|   Star Trek|
|  1|           Star Wars|Star Trek|Pulp Fiction|
|  2|the Lord of the R...|Star Trek|Pulp Fiction|
+---+--------------------+---------+------------+
Sign up to request clarification or add additional context in comments.

Comments

1

I figured it out. You should create aliases for your columns just like in SQL.

  val joined = recommendation
    .join(movies.select(col("id").as("id1"), 'name.as("n1")), 'id1 === recommendation.col("rec1"))
    .join(movies.select(col("id").as("id2"), 'name.as("n2")), 'id2 === recommendation.col("rec2"))
    .join(movies.select(col("id").as("id3"), 'name.as("n3")), 'id3  === recommendation.col("rec3"))
    .select('id, 'n1, 'n2, 'n3)
  joined.show()

Query will result in

+---+--------------------+---------+------------+
| id|                  n1|       n2|          n3|
+---+--------------------+---------+------------+
|  0|the Lord of the R...|Star Wars|   Star Trek|
|  1|           Star Wars|Star Trek|Pulp Fiction|
|  2|the Lord of the R...|Star Trek|Pulp Fiction|
+---+--------------------+---------+------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.