1

I'm facing an authentication issue when using MinIO with PySpark, Iceberg, and Nessie catalog. I can access the catalog, databases, and tables, but I cannot query the tables. Here's my setup:

version: '3.8'

x-common-variables: &aws_env
  AWS_ACCESS_KEY_ID: minioadmin
  AWS_SECRET_ACCESS_KEY: minioadmin
  AWS_REGION: us-west-1
  AWS_DEFAULT_REGION: us-west-1

  minio:
    image: minio/minio
    container_name: minio
    environment:
      - MINIO_ROOT_USER=minioadmin
      - MINIO_ROOT_PASSWORD=minioadmin
      - MINIO_DOMAIN=minio
      - MINIO_REGION=us-west-1
    ports:
      - 9001:9001
      - 9000:9000
    command: ["server", "/data", "--console-address", ":9001"]
    volumes:
      - minio-data:/data 
  mc:
    depends_on:
      - minio
    image: minio/mc
    container_name: mc
    environment:
      - AWS_ACCESS_KEY_ID=minioadmin
      - AWS_SECRET_ACCESS_KEY=minioadmin
      - AWS_REGION=us-west-1
      - MINIO_USER=minioadmin
      - MINIO_PASSWORD=minioadmin
      - MINIO_DOMAIN=minio
      - MINIO_REGION=us-west-1
    entrypoint: >
      /bin/sh -c "
      until (/usr/bin/mc config host add minio http://minio:9000 minioadmin minioadmin) do echo '...waiting...' && sleep 1; done;
      /usr/bin/mc mb minio/sensors-lakehouse;
      /usr/bin/mc policy set public minio/sensors-lakehouse;
      tail -f /dev/null
      "      
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: nessie
      POSTGRES_PASSWORD: nessie
      POSTGRES_DB: nessie
    ports:
      - "5432:5432"
    volumes:
      - postgres-data:/var/lib/postgresql/data
  nessie:
    image: projectnessie/nessie
    ports:
      - "19120:19120"
    environment:
      QUARKUS_HTTP_PORT: 19120
      NESSIE_VERSION_STORE_TYPE: JDBC
      QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://postgres:5432/nessie
      QUARKUS_DATASOURCE_USERNAME: nessie
      QUARKUS_DATASOURCE_PASSWORD: nessie
      QUARKUS_OIDC_ENABLED: "false"
      <<: *aws_env  
    depends_on:
      - postgres

  pyspark:
    build:
      context: ./pyspark
    volumes:
      - ./pyspark/scripts:/pyspark/scripts
    depends_on:
      - minio
      - mc
      - nessie
    environment:
       AWS_S3_ENDPOINT: "http://minio:9000"
       <<: *aws_env  
    


volumes:
  postgres-data:
    driver: local
  minio-data:   
    driver: local

the dockerfile I build for pyspark is:

FROM bitnami/spark:3.5

ADD https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.5.2/iceberg-spark-runtime-3.5_2.12-1.5.2.jar /opt/bitnami/spark/jars
ADD https://repo.maven.apache.org/maven2/org/projectnessie/nessie-integrations/nessie-spark-extensions-3.5_2.12/0.99.0/nessie-spark-extensions-3.5_2.12-0.99.0.jar /opt/bitnami/spark/jars
ADD https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.5.2/iceberg-aws-bundle-1.5.2.jar /opt/bitnami/spark/jars
ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.773/aws-java-sdk-bundle-1.12.773.jar /opt/bitnami/spark/jars
ADD https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.28.16/url-connection-client-2.28.16.jar /opt/bitnami/spark/jars
ADD https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.6/hadoop-aws-3.3.6.jar /opt/bitnami/spark/jars

USER root

RUN pip3 install py4j

The script I'm running:

spark = SparkSession.builder \
    .appName("test") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions") \
    .config("spark.sql.catalog.sensors_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.sensors_catalog.warehouse", "s3a://sensors-lakehouse") \
    .config("spark.sql.warehouse.dir", "s3a://sensors-lakehouse") \
    .config("spark.sql.catalog.sensors_catalog.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog") \
    .config("spark.sql.catalog.sensors_catalog.uri", "http://nessie:19120/api/v1") \
    .config("spark.sql.catalog.sensors_catalog.ref", "main") \
    .config("spark.sql.catalog.sensors_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .getOrCreate()
spark._jsc.hadoopConfiguration().unset("fs.s3a.aws.credentials.provider")
# Explicitly set the credentials in Hadoop configuration
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "minioadmin")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "minioadmin")
spark._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")
spark._jsc.hadoopConfiguration().set("fs.s3a.proxy.host", "minio")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "minio")
spark._jsc.hadoopConfiguration().set("fs.s3a.proxy.port", "9000")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint.region", "us-west-1")

Using this setup, I can successfully access the catalog, list databases, and view tables. However, when I try to query the tables, I get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o41.sql.
: software.amazon.awssdk.services.s3.model.S3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: S3, Status Code: 403, Request ID: MYBC49K73C1T64ZW, Extended Request ID: H/WT1VriLv0ppiW7fPMFIPRS0QERqMd0X4ooLYpnqrGrJ3DyA05NREqVIVC7u4VXpw9iQetwQlA=)

It seems like the Spark configuration in my script is being ignored. If I remove the credentials section from the script, it tries to read from the AWS_ environment variables. If the environment variables are missing, it throws an error stating that credentials are not found.

What I've Tried: Explicitly setting the credentials in the Spark script as shown. Using both minioadmin as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY values in both environment variables and directly in the Spark script. Setting the minIo bucket with public access

Question: Why is my Spark configuration for S3 credentials being ignored, and how can I resolve this S3Exception to query my tables?

Any help would be greatly appreciated. Thanks!

3
  • start with the docs hadoop.apache.org/docs/current//hadoop-aws/tools/hadoop-aws/… Commented Oct 7, 2024 at 16:33
  • I get same errors, from debug logs it seems that the endpoint has not been overwrited, it still searches every time to authenticate to aws... Commented Oct 12, 2024 at 7:45
  • hmm. pull out all those fs.s3a settings into a hadoop XML file and use cloudstore to review and interpret them in its storediag command. Its aim in life is to make support calls go away github.com/steveloughran/cloudstore Commented Oct 16, 2024 at 13:53

1 Answer 1

0

Try adding s3.default-options.path-style-access=true to the nessie nessie.catalog.service.s3.default-options.path-style-access=true that helps resolving .minio urls

e.g.

  nessie:
    image: ghcr.io/projectnessie/nessie:0.99.0
    container_name: nessie
    ports:
      - "19120:19120"
    environment:
      - nessie.version.store.type=IN_MEMORY
      - nessie.catalog.default-warehouse=warehouse
      - nessie.catalog.warehouses.warehouse.location=s3://demos-bucket/
      - nessie.catalog.service.s3.default-options.endpoint=http://minio:9000/
      - nessie.catalog.service.s3.default-options.path-style-access=true
      - nessie.catalog.service.s3.default-options.access-key=urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key
      - nessie.catalog.secrets.access-key.name=admin
      - nessie.catalog.secrets.access-key.secret=password
      - nessie.catalog.service.s3.default-options.region=us-east-1
      - nessie.server.authentication.enabled=false
    depends_on:
      - minio
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.