1

I need to save DataFrame in CSV or parquet format (as a single file) and then open it again. The amount of data will not exceed 60Mb, so a single file is reasonable solution. This simple task provides me a lot of headache... This is what I tried:

To read the file if it exists:

df = sqlContext
               .read.parquet("s3n://bucket/myTest.parquet")
               .toDF("key", "value", "date", "qty")

To write the file:

df.write.parquet("s3n://bucket/myTest.parquet")

This does not work because:

1) write creates the folder myTest.parquet with hadoopish files that later I cannot read with .read.parquet("s3n://bucket/myTest.parquet"). In fact I don't care about multiple hadoopish files, unless I can later read them easily into DataFrame. Is it possible?

2) I am always working with the same file myTest.parquet that I am updating and overwriting in S3. It tells me that the file cannot be saved because it already exists.

So, can someone indicate me a right way to do the read/write loop? The file format doesn't matter for me (csv,parquet,csv,hadoopish files) unleass I can make the read and write loop.

2
  • Have you seen the spark-csv package? Commented Nov 20, 2016 at 13:57
  • not sure if you know this, but you specify the parquet folder, not the actual file, when reading. if it already exists, use the .mode('overwrite') option Commented Feb 4 at 16:51

1 Answer 1

1

You can save your DataFrame with saveAsTable("TableName") and read it with table("TableName"). And the location can be set by spark.sql.warehouse.dir. And you can overwrite a file with mode(SaveMode.Ignore). You can read here more from the official documentation.

In Java it would look like this:

SparkSession spark = ...
spark.conf().set("spark.sql.warehouse.dir", "hdfs://localhost:9000/tables");
Dataset<Row> data = ...
data.write().mode(SaveMode.Overwrite).saveAsTable("TableName");

Now you can read from the Data with:

spark.read().table("TableName");
Sign up to request clarification or add additional context in comments.

3 Comments

May I ask you to give an example?
Edited my answer. Sry my example is in Java, because I am not very good in Scala. Do you use Spark 2.x.x? As far as I know there is no SparkSession, but you should be able to do it with SparkContext.
I use Spark 1.6.2, because Spark 2.0.0 is currently not advisable for the production.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.