12

Is there any way to run local master Spark SQL queries against AWS Glue?

Launch this code on my local PC:

SparkSession.builder()
    .master("local")
    .enableHiveSupport()
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
    .getOrCreate()
    .sql("show databases"); // this query isn't running against AWS Glue

EDIT based on some examples it appears that the hive.metastore.uris configuration key should allow specifying a specific metastore url, however, it's not clear how to get the relevant value for glue

SparkSession.builder()
    .master("local")
    .enableHiveSupport()
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
    .config("hive.metastore.uris", "thrift://???:9083")
    .getOrCreate()
    .sql("show databases"); // this query isn't running against AWS Glue
5
  • I think that it isn't possible, for two reasons: 1) You can run the glue code by using UI, boto3, dev endpoints, you can also use AWS Glue Data Catalog in AWS EMR, but according to my knowledge that is all options. 2) the Glue service bases on such technologies as Hive or Spark, but it isn't pure version of these technologies, there are limitations and this service uses its own library. Commented Sep 15, 2018 at 15:21
  • @j.b.gorski Looks like our Glue serves only as metadata store, and it doesn't transform data. So instead of mocking data for integration tests I can mock Glue reader wih S3 reader and read data directly from S3 (enforcing the same schema). The only error-prone point here is enforcing schema on CSV dataset read from S3 Commented Sep 15, 2018 at 22:09
  • 3
    @j.b.gorski What's strange: session.catalog().listDatabases() returns default database with Glue's description. Spark SQL also returns default when I'm doing show databases. But it does not see another Glue's databases Commented Sep 15, 2018 at 22:10
  • 2
    did you manage to find a solution? Commented Sep 15, 2019 at 14:07
  • 2
    did you find a way to do this? Commented Jan 13, 2022 at 7:43

2 Answers 2

4

Amazon provide this client that should solve the problem. (didn't try it yet)

https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore

Sign up to request clarification or add additional context in comments.

Comments

0

To run Spark locally with remote AWS S3 using AWS Glue Metadata Store, please follow these release notes I've created for the community after successfully building & testing the patched Hive libraries & relevant AWS Glue classes (as suggested by Ophir in the previous post).

https://github.com/jirislav/aws-glue-data-catalog-client-for-apache-hive-metastore/releases/tag/spark-3.3.0

I've shared the pre-built JARs so you don't have to do it yourself. But you can, I encourage you to do it.


TL;DR

Download

Make sure the sha512sum command succeeds.

cd /tmp
wget https://github.com/jirislav/aws-glue-data-catalog-client-for-apache-hive-metastore/releases/download/spark-3.3.0/spark-3.3.0-jars.tgz
sha512sum -c <(curl -sL https://github.com/jirislav/aws-glue-data-catalog-client-for-apache-hive-metastore/releases/download/spark-3.3.0/spark-3.3.0-jars.tgz.sha512)

Extract

cd "$SPARK_HOME/jars"
tar -xf /tmp/spark-3.3.0-jars.tgz

And finally adjust SPARK_CONF_DIR as discussed in the release notes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.