How to read and write DataFrame from Spark

Question

I need to save DataFrame in CSV or parquet format (as a single file) and then open it again. The amount of data will not exceed 60Mb, so a single file is reasonable solution. This simple task provides me a lot of headache... This is what I tried:

To read the file if it exists:

df = sqlContext
               .read.parquet("s3n://bucket/myTest.parquet")
               .toDF("key", "value", "date", "qty")

To write the file:

df.write.parquet("s3n://bucket/myTest.parquet")

This does not work because:

1) write creates the folder myTest.parquet with hadoopish files that later I cannot read with .read.parquet("s3n://bucket/myTest.parquet"). In fact I don't care about multiple hadoopish files, unless I can later read them easily into DataFrame. Is it possible?

2) I am always working with the same file myTest.parquet that I am updating and overwriting in S3. It tells me that the file cannot be saved because it already exists.

So, can someone indicate me a right way to do the read/write loop? The file format doesn't matter for me (csv,parquet,csv,hadoopish files) unleass I can make the read and write loop.

not sure if you know this, but you specify the parquet folder, not the actual file, when reading. if it already exists, use the .mode('overwrite') option — john k
– john k, Commented Feb 4 at 16:51

Simon Schiff · Accepted Answer · 2016-11-20 20:19:01Z

1

You can save your DataFrame with saveAsTable("TableName") and read it with table("TableName"). And the location can be set by spark.sql.warehouse.dir. And you can overwrite a file with mode(SaveMode.Ignore). You can read here more from the official documentation.

In Java it would look like this:

SparkSession spark = ...
spark.conf().set("spark.sql.warehouse.dir", "hdfs://localhost:9000/tables");
Dataset<Row> data = ...
data.write().mode(SaveMode.Overwrite).saveAsTable("TableName");

Now you can read from the Data with:

spark.read().table("TableName");

edited Nov 20, 2016 at 20:19

answered Nov 20, 2016 at 14:11

Simon Schiff

7116 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

duckertito Over a year ago

May I ask you to give an example?

Simon Schiff Over a year ago

Edited my answer. Sry my example is in Java, because I am not very good in Scala. Do you use Spark 2.x.x? As far as I know there is no SparkSession, but you should be able to do it with SparkContext.

duckertito Over a year ago

I use Spark 1.6.2, because Spark 2.0.0 is currently not advisable for the production.

Collectives™ on Stack Overflow

How to read and write DataFrame from Spark

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related