I need to save DataFrame in CSV or parquet format (as a single file) and then open it again. The amount of data will not exceed 60Mb, so a single file is reasonable solution. This simple task provides me a lot of headache... This is what I tried:
To read the file if it exists:
df = sqlContext
.read.parquet("s3n://bucket/myTest.parquet")
.toDF("key", "value", "date", "qty")
To write the file:
df.write.parquet("s3n://bucket/myTest.parquet")
This does not work because:
1) write creates the folder myTest.parquet with hadoopish files that later I cannot read with .read.parquet("s3n://bucket/myTest.parquet"). In fact I don't care about multiple hadoopish files, unless I can later read them easily into DataFrame. Is it possible?
2) I am always working with the same file myTest.parquet that I am updating and overwriting in S3. It tells me that the file cannot be saved because it already exists.
So, can someone indicate me a right way to do the read/write loop? The file format doesn't matter for me (csv,parquet,csv,hadoopish files) unleass I can make the read and write loop.
.mode('overwrite')option