0

I have 2 datasets in java spark, first one contains id, name , age and second one are the same, i need to check values (name and id) and if it's similar update the age with new age in dataset2

i tried all possible ways but i found that java in spark don't have much resourses and all possible ways not worked

this is one i tried :

        dataset1.createOrReplaceTempView("updatesTable");
        datase2.createOrReplaceTempView("carsTheftsFinal2");
        updatesNew.show();
          Dataset<Row> test = spark.sql( "ALTER carsTheftsFinal2 set carsTheftsFinal2.age = updatesTable.age from updatesTable where carsTheftsFinal2.id = updatesTable.id AND carsTheftsFinal2.name = updatesTable.name ");
         test.show(12);

and this is the error :

Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input 'ALTER carsTheftsFinal2'(line 1, pos 6)

I have hint: that i can use join to update without using update statement ( java spark not provide update )

1 Answer 1

1

Assume that we have ds1 with this data:

+---+-----------+---+
| id|       name|age|
+---+-----------+---+
|  1|    Someone| 18|
|  2|       Else| 17|
|  3|SomeoneElse| 14|
+---+-----------+---+

and ds2 with this data:

+---+----------+---+
| id|      name|age|
+---+----------+---+
|  1|   Someone| 14|
|  2|      Else| 18|
|  3|NotSomeone| 14|
+---+----------+---+

According to your expected result, the final table would be:

+---+-----------+---+
| id|       name|age|
+---+-----------+---+
|  3|SomeoneElse| 14| <-- not modified, same as ds
|  1|    Someone| 14| <-- modified, was 18
|  2|       Else| 18| <-- modified, was 17
+---+-----------+---+

This is achieved with the following transformations, first, we rename ds2's age with age2.

val renamedDs2 = ds2.withColumnRenamed("age", "age2")

Then:

// we join main dataset with the renamed ds2, now we have age and age2
ds1.join(renamedDs2, Seq("id", "name"), "left")
  // we overwrite age, if age2 is not null from ds2, take it, otherwise leave age
  .withColumn("age", 
    when(col("age2").isNotNull, col("age2")).otherwise(col("age"))
  )
  // finally, we drop age2
  .drop("age2")

Hope this does what you want!

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks alot, yes this is what i need, put with sorry i had a problem with Seq and isNotNull API's Cannot resolve method 'Seq' and Cannot resolve symbol 'isNotNull' how can i solve that please ?
In Java, you can do .isNotNull(), and for the joins, you can do ds1.col("id").equalTo(ds2.col("id")) && ds1.col("name").equalTo(ds2.col("name")) Please, mark the question as correct if you think this is what you want so other people can reach out easier
it worked thanks a lot, but there's small problem, id and name are repeated with id and name headers so if i dropped it main cols will dropped too, only age2 dropped , how can i drop it ?
If you are doing this in Java, then you will have to rename right table's names (as I did with age), then you can drop id2, name2 and age2.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.