I am a newbie to Spark and am trying to read & research as much as I can. Currently I am stuck on this and I have spent few days for solving. I have successfully set up a Spark Clusters on 3 machines (1 master, 2 slaves) and run some examples. Now I am trying to write a Python application which it will reads the csv file and then split each row in a JSON file and upload all of them to S3. Here are my problems:
I have converted the csv to Spark DataFrame, using
SparkSession.read.csv(), how do I split this DataFrame into multiple rows and convert to JSON? I have read that Spark DataFrame has the toJSON function but that applied to whole DataFrame, so how can I use thi function on each row of DataFrame instead of the whole one?How can I applied distributed system in my application, giving that I have 2 slaves and one master? Or does my application automatically split the work into smaller parts and assign to the slaves?
How can I put the converted JSON to S3, some sample code guidance would help me best.
I would be really appreciated if you could help me, thanks for your help in advance.