Pushing down filters in RDBMS with Java Spark

Question

I have been working as a Data Engineer and got this issue.
I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source.
Now somewhere later in the pipeline I have to again read data from RDBMS ,create a view(transactions).
Then I am running a Spark SQL query to join transactions view with input View based on some column.

Problem:
The problem here is that only the data which is required from transactions should be loaded but whole data is getting read.

Proposed Solution(Not Sure it is safe)
To solve this issue we are planning to create temp table in db and store inputView in RDBMS then do a join at db level and get data.
But the issue seems with this approach that
1. Temp data is available per session only. Spark executors will have their separate session while inserting inputView data in db. So how are they going to read data after join as spark write api will create session write data and then closes session. Even before the join query the data will be gone.
2. If I write each record one by one from driver using JDBC prepareStatements. Then for doing join and reading data I have to use the same connection to read data, I can't use spark read api to read it. So I read data by JDBC only, which will eventually load all the data in the driver, that can cause OOM.
3. Suppose multiple pipelines are running and mulitple pipelines try to insert their inputView data in some temp table. The database will be getting a lot of load, Won't it crash ?

Any Suggestion/Solution is welcomed.

Thanks in advance.

Answer 1 · 2025-11-14 09:00:33Z

user207421

• Nov 14 at 9:00

You need to provide (1) relevant table definitions (2) relevant view definitions (3) sample table data and (4) sample result data. Without all of that your question is just a guessing game. It also probably belongs on dba.stackexchange.com rather than here.

Answer 2 · 2025-11-15 10:19:25Z

My goal is to load transactions for only those customers which is present in Input View and do some processing on them later.

The current implementation is:

1. Hdfs has data for customers with columns (cust_id, date, uuid)
So , first I read this data and create input view on this one.

2. Now later in the pipeline I have create a view from transactions table of DB having schema
(transaction_id, customer_id, date, transaction_amt, tran_type, current_amt).

3. At this point I now have both views with me input and transactions view. Then i am running SPARK SQL on them as "Select i.cust_id, t.transaction_id, t.transaction_amt, t.tran_type from transactions t join input i on i.cust_id=t.customer_id && i.date= t.date"

Now what happens here is that Spark will load all the data from transactions table to create view which is not efficient.
I want to achieve some filter push down in RDBMS also like Spark does for Hdfs.

Answer 3 · 2025-11-16 06:20:54Z

'Spark will load all the data from transactions table to create view': no it won't. Creating the view just registers the SQL that defines it. No data is loaded at all. When you SELECT from that view, the data required to fill it is loaded, but that won't be the entirety of any table provided you have suitable indexes matching your WHERE and JOIN ... USING or ON clauses. A Data Engineer should already know all this, and you should certainly not just engage in guesswork as to how it all works. In other words make sure you've got a problem before you try to solve it.

Collectives™ on Stack Overflow

Pushing down filters in RDBMS with Java Spark

3 Replies 3

Your Reply

Collectives™ on Stack Overflow

Pushing down filters in RDBMS with Java Spark

3 Replies 3

Your Reply

Sign up or log in

Post as a guest