Access AWS Glue from local Spark

Question

Is there any way to run local master Spark SQL queries against AWS Glue?

Launch this code on my local PC:

SparkSession.builder()
    .master("local")
    .enableHiveSupport()
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
    .getOrCreate()
    .sql("show databases"); // this query isn't running against AWS Glue

EDIT based on some examples it appears that the hive.metastore.uris configuration key should allow specifying a specific metastore url, however, it's not clear how to get the relevant value for glue

SparkSession.builder()
    .master("local")
    .enableHiveSupport()
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
    .config("hive.metastore.uris", "thrift://???:9083")
    .getOrCreate()
    .sql("show databases"); // this query isn't running against AWS Glue

I think that it isn't possible, for two reasons: 1) You can run the glue code by using UI, boto3, dev endpoints, you can also use AWS Glue Data Catalog in AWS EMR, but according to my knowledge that is all options. 2) the Glue service bases on such technologies as Hive or Spark, but it isn't pure version of these technologies, there are limitations and this service uses its own library. — jbgorski
– jbgorski, Commented Sep 15, 2018 at 15:21
@j.b.gorski Looks like our Glue serves only as metadata store, and it doesn't transform data. So instead of mocking data for integration tests I can mock Glue reader wih S3 reader and read data directly from S3 (enforcing the same schema). The only error-prone point here is enforcing schema on CSV dataset read from S3 — VB_
– VB_, Commented Sep 15, 2018 at 22:09
@j.b.gorski What's strange: session.catalog().listDatabases() returns default database with Glue's description. Spark SQL also returns default when I'm doing show databases. But it does not see another Glue's databases — VB_
– VB_, Commented Sep 15, 2018 at 22:10

Ophir Yoktan · Accepted Answer · 2019-09-16 20:18:21Z

4

Amazon provide this client that should solve the problem. (didn't try it yet)

https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore

answered Sep 16, 2019 at 20:18

Ophir Yoktan

8,4877 gold badges66 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jirislav · Accepted Answer · 2024-01-26 11:16:34Z

To run Spark locally with remote AWS S3 using AWS Glue Metadata Store, please follow these release notes I've created for the community after successfully building & testing the patched Hive libraries & relevant AWS Glue classes (as suggested by Ophir in the previous post).

https://github.com/jirislav/aws-glue-data-catalog-client-for-apache-hive-metastore/releases/tag/spark-3.3.0

I've shared the pre-built JARs so you don't have to do it yourself. But you can, I encourage you to do it.

TL;DR

Download

Make sure the sha512sum command succeeds.

cd /tmp
wget https://github.com/jirislav/aws-glue-data-catalog-client-for-apache-hive-metastore/releases/download/spark-3.3.0/spark-3.3.0-jars.tgz
sha512sum -c <(curl -sL https://github.com/jirislav/aws-glue-data-catalog-client-for-apache-hive-metastore/releases/download/spark-3.3.0/spark-3.3.0-jars.tgz.sha512)

Extract

cd "$SPARK_HOME/jars"
tar -xf /tmp/spark-3.3.0-jars.tgz

And finally adjust SPARK_CONF_DIR as discussed in the release notes.

Collectives™ on Stack Overflow

Access AWS Glue from local Spark

2 Answers 2

Comments

TL;DR

Download

Extract

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

TL;DR

Download

Extract

Comments

Your Answer

Sign up or log in

Post as a guest

Related