Cannot authenticate to minIO for reading data in local bucket using pyspark

Question

I'm facing an authentication issue when using MinIO with PySpark, Iceberg, and Nessie catalog. I can access the catalog, databases, and tables, but I cannot query the tables. Here's my setup:

version: '3.8'

x-common-variables: &aws_env
  AWS_ACCESS_KEY_ID: minioadmin
  AWS_SECRET_ACCESS_KEY: minioadmin
  AWS_REGION: us-west-1
  AWS_DEFAULT_REGION: us-west-1

  minio:
    image: minio/minio
    container_name: minio
    environment:
      - MINIO_ROOT_USER=minioadmin
      - MINIO_ROOT_PASSWORD=minioadmin
      - MINIO_DOMAIN=minio
      - MINIO_REGION=us-west-1
    ports:
      - 9001:9001
      - 9000:9000
    command: ["server", "/data", "--console-address", ":9001"]
    volumes:
      - minio-data:/data 
  mc:
    depends_on:
      - minio
    image: minio/mc
    container_name: mc
    environment:
      - AWS_ACCESS_KEY_ID=minioadmin
      - AWS_SECRET_ACCESS_KEY=minioadmin
      - AWS_REGION=us-west-1
      - MINIO_USER=minioadmin
      - MINIO_PASSWORD=minioadmin
      - MINIO_DOMAIN=minio
      - MINIO_REGION=us-west-1
    entrypoint: >
      /bin/sh -c "
      until (/usr/bin/mc config host add minio http://minio:9000 minioadmin minioadmin) do echo '...waiting...' && sleep 1; done;
      /usr/bin/mc mb minio/sensors-lakehouse;
      /usr/bin/mc policy set public minio/sensors-lakehouse;
      tail -f /dev/null
      "      
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: nessie
      POSTGRES_PASSWORD: nessie
      POSTGRES_DB: nessie
    ports:
      - "5432:5432"
    volumes:
      - postgres-data:/var/lib/postgresql/data
  nessie:
    image: projectnessie/nessie
    ports:
      - "19120:19120"
    environment:
      QUARKUS_HTTP_PORT: 19120
      NESSIE_VERSION_STORE_TYPE: JDBC
      QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://postgres:5432/nessie
      QUARKUS_DATASOURCE_USERNAME: nessie
      QUARKUS_DATASOURCE_PASSWORD: nessie
      QUARKUS_OIDC_ENABLED: "false"
      <<: *aws_env  
    depends_on:
      - postgres

  pyspark:
    build:
      context: ./pyspark
    volumes:
      - ./pyspark/scripts:/pyspark/scripts
    depends_on:
      - minio
      - mc
      - nessie
    environment:
       AWS_S3_ENDPOINT: "http://minio:9000"
       <<: *aws_env  
    


volumes:
  postgres-data:
    driver: local
  minio-data:   
    driver: local

the dockerfile I build for pyspark is:

FROM bitnami/spark:3.5

ADD https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.5.2/iceberg-spark-runtime-3.5_2.12-1.5.2.jar /opt/bitnami/spark/jars
ADD https://repo.maven.apache.org/maven2/org/projectnessie/nessie-integrations/nessie-spark-extensions-3.5_2.12/0.99.0/nessie-spark-extensions-3.5_2.12-0.99.0.jar /opt/bitnami/spark/jars
ADD https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.5.2/iceberg-aws-bundle-1.5.2.jar /opt/bitnami/spark/jars
ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.773/aws-java-sdk-bundle-1.12.773.jar /opt/bitnami/spark/jars
ADD https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.28.16/url-connection-client-2.28.16.jar /opt/bitnami/spark/jars
ADD https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.6/hadoop-aws-3.3.6.jar /opt/bitnami/spark/jars

USER root

RUN pip3 install py4j

The script I'm running:

spark = SparkSession.builder \
    .appName("test") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions") \
    .config("spark.sql.catalog.sensors_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.sensors_catalog.warehouse", "s3a://sensors-lakehouse") \
    .config("spark.sql.warehouse.dir", "s3a://sensors-lakehouse") \
    .config("spark.sql.catalog.sensors_catalog.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog") \
    .config("spark.sql.catalog.sensors_catalog.uri", "http://nessie:19120/api/v1") \
    .config("spark.sql.catalog.sensors_catalog.ref", "main") \
    .config("spark.sql.catalog.sensors_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .getOrCreate()
spark._jsc.hadoopConfiguration().unset("fs.s3a.aws.credentials.provider")
# Explicitly set the credentials in Hadoop configuration
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "minioadmin")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "minioadmin")
spark._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")
spark._jsc.hadoopConfiguration().set("fs.s3a.proxy.host", "minio")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "minio")
spark._jsc.hadoopConfiguration().set("fs.s3a.proxy.port", "9000")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint.region", "us-west-1")

Using this setup, I can successfully access the catalog, list databases, and view tables. However, when I try to query the tables, I get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o41.sql.
: software.amazon.awssdk.services.s3.model.S3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: S3, Status Code: 403, Request ID: MYBC49K73C1T64ZW, Extended Request ID: H/WT1VriLv0ppiW7fPMFIPRS0QERqMd0X4ooLYpnqrGrJ3DyA05NREqVIVC7u4VXpw9iQetwQlA=)

It seems like the Spark configuration in my script is being ignored. If I remove the credentials section from the script, it tries to read from the AWS_ environment variables. If the environment variables are missing, it throws an error stating that credentials are not found.

What I've Tried: Explicitly setting the credentials in the Spark script as shown. Using both minioadmin as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY values in both environment variables and directly in the Spark script. Setting the minIo bucket with public access

Question: Why is my Spark configuration for S3 credentials being ignored, and how can I resolve this S3Exception to query my tables?

Any help would be greatly appreciated. Thanks!

start with the docs hadoop.apache.org/docs/current//hadoop-aws/tools/hadoop-aws/… — stevel
– stevel, Commented Oct 7, 2024 at 16:33
I get same errors, from debug logs it seems that the endpoint has not been overwrited, it still searches every time to authenticate to aws... — A_A
– A_A, Commented Oct 12, 2024 at 7:45
hmm. pull out all those fs.s3a settings into a hadoop XML file and use cloudstore to review and interpret them in its storediag command. Its aim in life is to make support calls go away github.com/steveloughran/cloudstore — stevel
– stevel, Commented Oct 16, 2024 at 13:53

Kamesh · Accepted Answer · 2025-01-21 05:12:38Z

Try adding s3.default-options.path-style-access=true to the nessie nessie.catalog.service.s3.default-options.path-style-access=true that helps resolving .minio urls

e.g.

  nessie:
    image: ghcr.io/projectnessie/nessie:0.99.0
    container_name: nessie
    ports:
      - "19120:19120"
    environment:
      - nessie.version.store.type=IN_MEMORY
      - nessie.catalog.default-warehouse=warehouse
      - nessie.catalog.warehouses.warehouse.location=s3://demos-bucket/
      - nessie.catalog.service.s3.default-options.endpoint=http://minio:9000/
      - nessie.catalog.service.s3.default-options.path-style-access=true
      - nessie.catalog.service.s3.default-options.access-key=urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key
      - nessie.catalog.secrets.access-key.name=admin
      - nessie.catalog.secrets.access-key.secret=password
      - nessie.catalog.service.s3.default-options.region=us-east-1
      - nessie.server.authentication.enabled=false
    depends_on:
      - minio

Collectives™ on Stack Overflow

Cannot authenticate to minIO for reading data in local bucket using pyspark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related