I have been working as a Data Engineer and got this issue.
I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source.
Now somewhere later in the pipeline I have to again read data from RDBMS ,create a view(transactions).
Then I am running a Spark SQL query to join transactions view with input View based on some column.

Problem:
The problem here is that only the data which is required from transactions should be loaded but whole data is getting read.

Proposed Solution(Not Sure it is safe)
To solve this issue we are planning to create temp table in db and store inputView in RDBMS then do a join at db level and get data.
But the issue seems with this approach that
1. Temp data is available per session only. Spark executors will have their separate session while inserting inputView data in db. So how are they going to read data after join as spark write api will create session write data and then closes session. Even before the join query the data will be gone.
2. If I write each record one by one from driver using JDBC prepareStatements. Then for doing join and reading data I have to use the same connection to read data, I can't use spark read api to read it. So I read data by JDBC only, which will eventually load all the data in the driver, that can cause OOM.
3. Suppose multiple pipelines are running and mulitple pipelines try to insert their inputView data in some temp table. The database will be getting a lot of load, Won't it crash ?

Any Suggestion/Solution is welcomed.

Thanks in advance.

3 Replies 3

You need to provide (1) relevant table definitions (2) relevant view definitions (3) sample table data and (4) sample result data. Without all of that your question is just a guessing game. It also probably belongs on dba.stackexchange.com rather than here.

My goal is to load transactions for only those customers which is present in Input View and do some processing on them later.

The current implementation is:

1. Hdfs has data for customers with columns (cust_id, date, uuid)
So , first I read this data and create input view on this one.

2. Now later in the pipeline I have create a view from transactions table of DB having schema
(transaction_id, customer_id, date, transaction_amt, tran_type, current_amt).

3. At this point I now have both views with me input and transactions view. Then i am running SPARK SQL on them as "Select i.cust_id, t.transaction_id, t.transaction_amt, t.tran_type from transactions t join input i on i.cust_id=t.customer_id && i.date= t.date"

Now what happens here is that Spark will load all the data from transactions table to create view which is not efficient.
I want to achieve some filter push down in RDBMS also like Spark does for Hdfs.

'Spark will load all the data from transactions table to create view': no it won't. Creating the view just registers the SQL that defines it. No data is loaded at all. When you SELECT from that view, the data required to fill it is loaded, but that won't be the entirety of any table provided you have suitable indexes matching your WHERE and JOIN ... USING or ON clauses. A Data Engineer should already know all this, and you should certainly not just engage in guesswork as to how it all works. In other words make sure you've got a problem before you try to solve it.

Your Reply

By clicking “Post Your Reply”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.