0

I am a newbie to Spark and am trying to read & research as much as I can. Currently I am stuck on this and I have spent few days for solving. I have successfully set up a Spark Clusters on 3 machines (1 master, 2 slaves) and run some examples. Now I am trying to write a Python application which it will reads the csv file and then split each row in a JSON file and upload all of them to S3. Here are my problems:

  1. I have converted the csv to Spark DataFrame, using SparkSession.read.csv(), how do I split this DataFrame into multiple rows and convert to JSON? I have read that Spark DataFrame has the toJSON function but that applied to whole DataFrame, so how can I use thi function on each row of DataFrame instead of the whole one?

  2. How can I applied distributed system in my application, giving that I have 2 slaves and one master? Or does my application automatically split the work into smaller parts and assign to the slaves?

  3. How can I put the converted JSON to S3, some sample code guidance would help me best.

I would be really appreciated if you could help me, thanks for your help in advance.

1 Answer 1

1
  1. To read json files, you can use sqlContext.jsonFile(). The you can use regular SQL queries for processing. You can see here from more information
  2. The spark works on partitions. Your data would be divided into partitions and run on executors. That would be taken by spark based on the mode you are using. Not sure if you are using YARN.
  3. In python, you can use boto3 to make the data saved to amazon s3. Its a very easy to use package. Look here
Sign up to request clarification or add additional context in comments.

1 Comment

All of your points are correct and helped me a lot in finding answer. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.