I'm reading a large number of files from s3 bucket.
After reading those files, I want to perform filter operation on the dataframe.
But when filter operation is executing, data gets downloaded again from s3 bucket. How can I avoid dataframe reloading?
I have tried caching and/or persisting dataframe before the filter operation. But still, data is pulled from s3 bucket again in spark somehow.
var df = spark.read.json("path_to_s3_bucket/*.json")
df.persist(StorageLevel.MEMORY_AND_DISK_SER_2)
df = df.filter("filter condition").sort(col("columnName").asc)
If the dataframe is cached, it should not be reloaded again from s3.
explain? How are you sure this is reading from bucket again ?