I'm facing an authentication issue when using MinIO with PySpark, Iceberg, and Nessie catalog. I can access the catalog, databases, and tables, but I cannot query the tables. Here's my setup:
version: '3.8'
x-common-variables: &aws_env
AWS_ACCESS_KEY_ID: minioadmin
AWS_SECRET_ACCESS_KEY: minioadmin
AWS_REGION: us-west-1
AWS_DEFAULT_REGION: us-west-1
minio:
image: minio/minio
container_name: minio
environment:
- MINIO_ROOT_USER=minioadmin
- MINIO_ROOT_PASSWORD=minioadmin
- MINIO_DOMAIN=minio
- MINIO_REGION=us-west-1
ports:
- 9001:9001
- 9000:9000
command: ["server", "/data", "--console-address", ":9001"]
volumes:
- minio-data:/data
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
environment:
- AWS_ACCESS_KEY_ID=minioadmin
- AWS_SECRET_ACCESS_KEY=minioadmin
- AWS_REGION=us-west-1
- MINIO_USER=minioadmin
- MINIO_PASSWORD=minioadmin
- MINIO_DOMAIN=minio
- MINIO_REGION=us-west-1
entrypoint: >
/bin/sh -c "
until (/usr/bin/mc config host add minio http://minio:9000 minioadmin minioadmin) do echo '...waiting...' && sleep 1; done;
/usr/bin/mc mb minio/sensors-lakehouse;
/usr/bin/mc policy set public minio/sensors-lakehouse;
tail -f /dev/null
"
postgres:
image: postgres:13
environment:
POSTGRES_USER: nessie
POSTGRES_PASSWORD: nessie
POSTGRES_DB: nessie
ports:
- "5432:5432"
volumes:
- postgres-data:/var/lib/postgresql/data
nessie:
image: projectnessie/nessie
ports:
- "19120:19120"
environment:
QUARKUS_HTTP_PORT: 19120
NESSIE_VERSION_STORE_TYPE: JDBC
QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://postgres:5432/nessie
QUARKUS_DATASOURCE_USERNAME: nessie
QUARKUS_DATASOURCE_PASSWORD: nessie
QUARKUS_OIDC_ENABLED: "false"
<<: *aws_env
depends_on:
- postgres
pyspark:
build:
context: ./pyspark
volumes:
- ./pyspark/scripts:/pyspark/scripts
depends_on:
- minio
- mc
- nessie
environment:
AWS_S3_ENDPOINT: "http://minio:9000"
<<: *aws_env
volumes:
postgres-data:
driver: local
minio-data:
driver: local
the dockerfile I build for pyspark is:
FROM bitnami/spark:3.5
ADD https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.5.2/iceberg-spark-runtime-3.5_2.12-1.5.2.jar /opt/bitnami/spark/jars
ADD https://repo.maven.apache.org/maven2/org/projectnessie/nessie-integrations/nessie-spark-extensions-3.5_2.12/0.99.0/nessie-spark-extensions-3.5_2.12-0.99.0.jar /opt/bitnami/spark/jars
ADD https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.5.2/iceberg-aws-bundle-1.5.2.jar /opt/bitnami/spark/jars
ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.773/aws-java-sdk-bundle-1.12.773.jar /opt/bitnami/spark/jars
ADD https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.28.16/url-connection-client-2.28.16.jar /opt/bitnami/spark/jars
ADD https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.6/hadoop-aws-3.3.6.jar /opt/bitnami/spark/jars
USER root
RUN pip3 install py4j
The script I'm running:
spark = SparkSession.builder \
.appName("test") \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions") \
.config("spark.sql.catalog.sensors_catalog", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.sensors_catalog.warehouse", "s3a://sensors-lakehouse") \
.config("spark.sql.warehouse.dir", "s3a://sensors-lakehouse") \
.config("spark.sql.catalog.sensors_catalog.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog") \
.config("spark.sql.catalog.sensors_catalog.uri", "http://nessie:19120/api/v1") \
.config("spark.sql.catalog.sensors_catalog.ref", "main") \
.config("spark.sql.catalog.sensors_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
.getOrCreate()
spark._jsc.hadoopConfiguration().unset("fs.s3a.aws.credentials.provider")
# Explicitly set the credentials in Hadoop configuration
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "minioadmin")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "minioadmin")
spark._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")
spark._jsc.hadoopConfiguration().set("fs.s3a.proxy.host", "minio")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "minio")
spark._jsc.hadoopConfiguration().set("fs.s3a.proxy.port", "9000")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint.region", "us-west-1")
Using this setup, I can successfully access the catalog, list databases, and view tables. However, when I try to query the tables, I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o41.sql.
: software.amazon.awssdk.services.s3.model.S3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: S3, Status Code: 403, Request ID: MYBC49K73C1T64ZW, Extended Request ID: H/WT1VriLv0ppiW7fPMFIPRS0QERqMd0X4ooLYpnqrGrJ3DyA05NREqVIVC7u4VXpw9iQetwQlA=)
It seems like the Spark configuration in my script is being ignored. If I remove the credentials section from the script, it tries to read from the AWS_ environment variables. If the environment variables are missing, it throws an error stating that credentials are not found.
What I've Tried: Explicitly setting the credentials in the Spark script as shown. Using both minioadmin as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY values in both environment variables and directly in the Spark script. Setting the minIo bucket with public access
Question: Why is my Spark configuration for S3 credentials being ignored, and how can I resolve this S3Exception to query my tables?
Any help would be greatly appreciated. Thanks!