I have been working as a Data Engineer and got this issue.
I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source.
Now somewhere later in the pipeline I have to again read data from RDBMS ,create a view(transactions).
Then I am running a Spark SQL query to join transactions view with input View based on some column.
Problem:
The problem here is that only the data which is required from transactions should be loaded but whole data is getting read.
Proposed Solution(Not Sure it is safe)
To solve this issue we are planning to create temp table in db and store inputView in RDBMS then do a join at db level and get data.
But the issue seems with this approach that
1. Temp data is available per session only. Spark executors will have their separate session while inserting inputView data in db. So how are they going to read data after join as spark write api will create session write data and then closes session. Even before the join query the data will be gone.
2. If I write each record one by one from driver using JDBC prepareStatements. Then for doing join and reading data I have to use the same connection to read data, I can't use spark read api to read it. So I read data by JDBC only, which will eventually load all the data in the driver, that can cause OOM.
3. Suppose multiple pipelines are running and mulitple pipelines try to insert their inputView data in some temp table. The database will be getting a lot of load, Won't it crash ?
Any Suggestion/Solution is welcomed.
Thanks in advance.