1

I'd like to add a column to a table and then fill it with values from another table. Below is a highly simplified version of my problem.

CREATE TABLE table_1 (
   id INT,
   a DECIMAL(19,2)
)

INSERT INTO TABLE table_1 VALUES (1, 3.0)
INSERT INTO TABLE table_1 VALUES (2, 4.0)

CREATE TABLE table_2 (
   id INT,
   b DECIMAL(19,2),
   c DECIMAL(19,2)
)

INSERT INTO TABLE table_2 VALUES (1, 1.0, 4.0)
INSERT INTO TABLE table_2 VALUES (2, 2.0, 1.0)

-- The next two parts illustrate what I'd like to accomplish
ALTER TABLE table_1 ADD COLUMNS (d Decimal(19,2))

UPDATE table_1
SET d = (table_1.a - table_2.b) / table_2.c
FROM table_2
WHERE table_1.id = table_2.id

In the end SELECT * FROM table_1 would produce something like this:

+---+----+----+
| id|   a|   d|
+---+----+----+
|  1|3.00|0.50|
|  2|4.00|2.00|
+---+----+----+

However, when I run the update commands, Spark (version 2.4) immediately complains about the update statement.

UPDATE table_1 ...
^^^

Ultimately I need a table with the same name as the original table and with the new column. Using only Spark SQL, what can I do to accomplish my objective? It seems like I can't perform an update, but are there SQL hacks I can do that accomplish the same end result? In my real problem, I need to add about 100 columns to a large table, so the solution should also not drag down performance or make lots of copies of the data and eat up disk space.

Another way of rephrasing my question is, can I accomplish the DataBricks equivalent of an UPDATE (see here) using the open source version of Spark?

2 Answers 2

1

One way is to create 2 temporary tables, populate those, and then join those to create your final table. General steps and (untested) code are below.

1) Create temporary tables

CREATE TEMPORARY TABLE temp_table_1 (
   id INT,
   a DECIMAL(19,2)
)

INSERT INTO TABLE temp_table_1 VALUES (1, 3.0)
INSERT INTO TABLE temp_table_1 VALUES (2, 4.0)

CREATE TEMPORARY TABLE temp_table_2 (
   id INT,
   b DECIMAL(19,2),
   c DECIMAL(19,2)
)

INSERT INTO TABLE temp_table_2 VALUES (1, 1.0, 4.0)
INSERT INTO TABLE temp_table_2 VALUES (2, 2.0, 1.0)

2) Create your final table

CREATE TABLE table_1 
AS
SELECT t1.id, t1.a, t2.b, (t1.a - t1.b) / t2.c as d
FROM table_1 AS t1
JOIN table_2 AS t2 ON t1.id = t2.id
Sign up to request clarification or add additional context in comments.

Comments

1

Remember that Spark isn't a database; dataframes are table-like references that can be queried, but are not the same as tables. What you want to do is create a view that combines your tables into a table-like structure, and then persist or use that view.

CREATE TEMPORARY VIEW table_3 AS
SELECT t1.a, t2.b, t2.c, (t1.a - t2.b) / t2.c as d
FROM table_1 t1 INNER JOIN table_2 t2
ON t1.id = t2.id

You'll eventually want to write that view back to a table, but you don't need to do this after adding each of your 100 columns.

1 Comment

If you were using regular Spark, you'd be able to re-assign and reuse the name that you use for the dataframe / temporary table. Using SparkSQL directly (via JDBC?) I don't know how you'd do that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.