Suppose you access a SQL database with spark SQL. With RDD spark partitions the data into many different parts that all together make the data set.
My question is how does Spark SQL manages this access from the N nodes to the database. I can see several possibilities:
Each nodes of the RDD access to the database and builds up their parts. Advantage of it is that the nodes are not forced to allocate a lot of memory, but the database will have to stand N connections with N potentially very large.
A single node access the data and sends the data to the other N-1 nodes as required. The problem is that this single node will need to have all the data and this is unworkable in many cases. Possibly this can be alleviated by getting the data by chunks.
The JDBC package uses the pooled connections in order to avoid connecting again and again. But this does not address this problem.
What would be a reference explaining how spark manages this access to SQL database? How much of it can be parametrized?