Update column in Spark table using SQL

Question

I'd like to add a column to a table and then fill it with values from another table. Below is a highly simplified version of my problem.

CREATE TABLE table_1 (
   id INT,
   a DECIMAL(19,2)
)

INSERT INTO TABLE table_1 VALUES (1, 3.0)
INSERT INTO TABLE table_1 VALUES (2, 4.0)

CREATE TABLE table_2 (
   id INT,
   b DECIMAL(19,2),
   c DECIMAL(19,2)
)

INSERT INTO TABLE table_2 VALUES (1, 1.0, 4.0)
INSERT INTO TABLE table_2 VALUES (2, 2.0, 1.0)

-- The next two parts illustrate what I'd like to accomplish
ALTER TABLE table_1 ADD COLUMNS (d Decimal(19,2))

UPDATE table_1
SET d = (table_1.a - table_2.b) / table_2.c
FROM table_2
WHERE table_1.id = table_2.id

In the end SELECT * FROM table_1 would produce something like this:

+---+----+----+
| id|   a|   d|
+---+----+----+
|  1|3.00|0.50|
|  2|4.00|2.00|
+---+----+----+

However, when I run the update commands, Spark (version 2.4) immediately complains about the update statement.

UPDATE table_1 ...
^^^

Ultimately I need a table with the same name as the original table and with the new column. Using only Spark SQL, what can I do to accomplish my objective? It seems like I can't perform an update, but are there SQL hacks I can do that accomplish the same end result? In my real problem, I need to add about 100 columns to a large table, so the solution should also not drag down performance or make lots of copies of the data and eat up disk space.

Another way of rephrasing my question is, can I accomplish the DataBricks equivalent of an UPDATE (see here) using the open source version of Spark?

Duke Silver · Accepted Answer · 2019-02-28 23:29:23Z

1

One way is to create 2 temporary tables, populate those, and then join those to create your final table. General steps and (untested) code are below.

1) Create temporary tables

CREATE TEMPORARY TABLE temp_table_1 (
   id INT,
   a DECIMAL(19,2)
)

INSERT INTO TABLE temp_table_1 VALUES (1, 3.0)
INSERT INTO TABLE temp_table_1 VALUES (2, 4.0)

CREATE TEMPORARY TABLE temp_table_2 (
   id INT,
   b DECIMAL(19,2),
   c DECIMAL(19,2)
)

INSERT INTO TABLE temp_table_2 VALUES (1, 1.0, 4.0)
INSERT INTO TABLE temp_table_2 VALUES (2, 2.0, 1.0)

2) Create your final table

CREATE TABLE table_1 
AS
SELECT t1.id, t1.a, t2.b, (t1.a - t1.b) / t2.c as d
FROM table_1 AS t1
JOIN table_2 AS t2 ON t1.id = t2.id

answered Feb 28, 2019 at 23:29

Duke Silver

1152 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Kirk Broadhurst · Accepted Answer · 2019-03-01 01:34:47Z

1

Remember that Spark isn't a database; dataframes are table-like references that can be queried, but are not the same as tables. What you want to do is create a view that combines your tables into a table-like structure, and then persist or use that view.

CREATE TEMPORARY VIEW table_3 AS
SELECT t1.a, t2.b, t2.c, (t1.a - t2.b) / t2.c as d
FROM table_1 t1 INNER JOIN table_2 t2
ON t1.id = t2.id

You'll eventually want to write that view back to a table, but you don't need to do this after adding each of your 100 columns.

answered Mar 1, 2019 at 1:34

Kirk Broadhurst

28.9k20 gold badges111 silver badges179 bronze badges

1 Comment

Kirk Broadhurst Over a year ago

If you were using regular Spark, you'd be able to re-assign and reuse the name that you use for the dataframe / temporary table. Using SparkSQL directly (via JDBC?) I don't know how you'd do that.

Collectives™ on Stack Overflow

Update column in Spark table using SQL

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related