I'm unable to save Pyspark dataframe to S3 bucket.
I'm running the code inside docker dev container
My AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are setup in the environment
Env Setup
Base image: gcr.io/datamechanics/spark:platform-3.2.1-hadoop-3.3.1-java-11-scala-2.12-python-3.8-dm18
I've following jars available in /opt/spark/jars: 'aws-java-sdk-bundle-1.11.901.jar', 'aws-java-sdk-core-1.11.797.jar', 'aws-java-sdk-glue-1.11.797.jar', 'hadoop-aws-3.3.1.jar',
Sample code
`from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.dynamicAllocation.enabled", "true") \
.config("spark.dynamicAllocation.maxExecutors", "4") \
.config("spark.dynamicAllocation.minExecutors", "1") \
.config("spark.dynamicAllocation.initialExecutors", "1") \
.config("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED") \
.config("spark.sql.legacy.pathOptionBehavior.enabled", "true") \
.config("spark.sql.parquet.datetimeRebaseModeInWrite", "CORRECTED") \
.getOrCreate()
source_file = "/workspaces/sample/test/*"
df = spark.read.parquet(source_file)
df.write.format("parquet").mode("append").save("s3a://MY_BUCKET/MY_FOLDER/")`
ERROR: java.io.IOException: regular upload failed: java.lang.NoSuchMethodError: 'void com.amazonaws.util.IOUtils.release(java.io.Closeable, com.amazonaws.thirdparty.apache.logging.Log)'
I checked multiple blogs, the error is mainly because of version mismatch is what developers are recommending. The versions looks fine to me because the same setup is working for me when I'm running the same code with same setup in AWS env but when I'm trying to run the same setup from local I'm getting above mentioned error.