0

I'm unable to save Pyspark dataframe to S3 bucket.

I'm running the code inside docker dev container

My AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are setup in the environment

Env Setup

Base image: gcr.io/datamechanics/spark:platform-3.2.1-hadoop-3.3.1-java-11-scala-2.12-python-3.8-dm18

I've following jars available in /opt/spark/jars: 'aws-java-sdk-bundle-1.11.901.jar', 'aws-java-sdk-core-1.11.797.jar', 'aws-java-sdk-glue-1.11.797.jar', 'hadoop-aws-3.3.1.jar',

Sample code

`from pyspark.sql import SparkSession
spark = SparkSession.builder \
            .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
            .config("spark.dynamicAllocation.enabled", "true") \
            .config("spark.dynamicAllocation.maxExecutors", "4") \
            .config("spark.dynamicAllocation.minExecutors", "1") \
            .config("spark.dynamicAllocation.initialExecutors", "1") \
            .config("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED") \
            .config("spark.sql.legacy.pathOptionBehavior.enabled", "true") \
            .config("spark.sql.parquet.datetimeRebaseModeInWrite", "CORRECTED") \
            .getOrCreate()

source_file = "/workspaces/sample/test/*"
df = spark.read.parquet(source_file)
df.write.format("parquet").mode("append").save("s3a://MY_BUCKET/MY_FOLDER/")`

ERROR: java.io.IOException: regular upload failed: java.lang.NoSuchMethodError: 'void com.amazonaws.util.IOUtils.release(java.io.Closeable, com.amazonaws.thirdparty.apache.logging.Log)'

I checked multiple blogs, the error is mainly because of version mismatch is what developers are recommending. The versions looks fine to me because the same setup is working for me when I'm running the same code with same setup in AWS env but when I'm trying to run the same setup from local I'm getting above mentioned error.

1 Answer 1

0

you should only have the aws-sdk-bundle jar on the classpath; the other two aws-sdk are from different releases and will only "give you stack traces" as the hadoop s3a docs cover in some detail. the bundle.jar file contains these libraries and shaded versions of all their dependencies.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.