update row value using another table in spark java

Question

I have 2 datasets in java spark, first one contains id, name , age and second one are the same, i need to check values (name and id) and if it's similar update the age with new age in dataset2

i tried all possible ways but i found that java in spark don't have much resourses and all possible ways not worked

this is one i tried :

        dataset1.createOrReplaceTempView("updatesTable");
        datase2.createOrReplaceTempView("carsTheftsFinal2");
        updatesNew.show();
          Dataset<Row> test = spark.sql( "ALTER carsTheftsFinal2 set carsTheftsFinal2.age = updatesTable.age from updatesTable where carsTheftsFinal2.id = updatesTable.id AND carsTheftsFinal2.name = updatesTable.name ");
         test.show(12);

and this is the error :

Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input 'ALTER carsTheftsFinal2'(line 1, pos 6)

I have hint: that i can use join to update without using update statement ( java spark not provide update )

user9276715 · Accepted Answer · 2022-07-20 21:47:56Z

1

Assume that we have ds1 with this data:

+---+-----------+---+
| id|       name|age|
+---+-----------+---+
|  1|    Someone| 18|
|  2|       Else| 17|
|  3|SomeoneElse| 14|
+---+-----------+---+

and ds2 with this data:

+---+----------+---+
| id|      name|age|
+---+----------+---+
|  1|   Someone| 14|
|  2|      Else| 18|
|  3|NotSomeone| 14|
+---+----------+---+

According to your expected result, the final table would be:

+---+-----------+---+
| id|       name|age|
+---+-----------+---+
|  3|SomeoneElse| 14| <-- not modified, same as ds
|  1|    Someone| 14| <-- modified, was 18
|  2|       Else| 18| <-- modified, was 17
+---+-----------+---+

This is achieved with the following transformations, first, we rename ds2's age with age2.

val renamedDs2 = ds2.withColumnRenamed("age", "age2")

Then:

// we join main dataset with the renamed ds2, now we have age and age2
ds1.join(renamedDs2, Seq("id", "name"), "left")
  // we overwrite age, if age2 is not null from ds2, take it, otherwise leave age
  .withColumn("age", 
    when(col("age2").isNotNull, col("age2")).otherwise(col("age"))
  )
  // finally, we drop age2
  .drop("age2")

Hope this does what you want!

answered Jul 20, 2022 at 21:47

user9276715

Sign up to request clarification or add additional context in comments.

4 Comments

Sara BB Over a year ago

Thanks alot, yes this is what i need, put with sorry i had a problem with Seq and isNotNull API's Cannot resolve method 'Seq' and Cannot resolve symbol 'isNotNull' how can i solve that please ?

user9276715 Over a year ago

In Java, you can do .isNotNull(), and for the joins, you can do ds1.col("id").equalTo(ds2.col("id")) && ds1.col("name").equalTo(ds2.col("name")) Please, mark the question as correct if you think this is what you want so other people can reach out easier

Sara BB Over a year ago

it worked thanks a lot, but there's small problem, id and name are repeated with id and name headers so if i dropped it main cols will dropped too, only age2 dropped , how can i drop it ?

user9276715 Over a year ago

If you are doing this in Java, then you will have to rename right table's names (as I did with age), then you can drop id2, name2 and age2.

Collectives™ on Stack Overflow

update row value using another table in spark java

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related