Spark on Glue is not able to connect to AWS/ElasticSearch

Question

I am running Spark inside Glue to write down to AWS/ElasticSearch with the following configuration for Spark:

  conf.set("es.nodes", s"$nodes/$indexName")
  conf.set("es.port", "443")
  conf.set("es.batch.write.retry.count", "200")
  conf.set("es.batch.size.bytes", "512kb")
  conf.set("es.batch.size.entries", "500")
  conf.set("es.index.auto.create", "false")
  conf.set("es.nodes.wan.only", "true")
  conf.set("es.net.ssl", "true")

however what I get is the following error:

diagnostics: User class threw exception: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
    at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
    at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)
    ....

I know in which "VPC" is running my ElasticSearch instance, but I am not sure how to set that for Glue/Spark or if it is a different problem. Any idea?

I have also tried to add a "glue jdbc" connection which should use the proper VPC connection but I am not sure how set it up properly:

  import scala.reflect.runtime.universe._
  def saveToEs[T <: Product : TypeTag](index: String, data: RDD[T]) =
    SparkProvider.glueContext.getJDBCSink(
      catalogConnection = "my-elasticsearch-connection",
      options = JsonOptions(
        "WHAT HERE?"
      ),
      transformationContext = "SinkToElasticSearch"
    ).writeDynamicFrame(DynamicFrame(
      SparkProvider.sqlContext.createDataFrame[T](data),
      SparkProvider.glueContext))

Eman · Accepted Answer · 2020-04-27 07:31:18Z

1

Try to create to create a dummy JDBC connection. The dummy connection will tell Glue the ES - VPC, subnet and security group. A test connection might not work but when you run your job with the connection, it will use the connection metadata to launch elastic network interface in your VPC to facilitate this communication. More on connections can be found here:

[1] https://docs.aws.amazon.com/glue/latest/dg/start-connecting.html

answered Apr 27, 2020 at 7:31

Eman

8515 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Randomize Over a year ago

thank you, I already tried that (please look at my updated answer) but not any luck so far.

Eman Over a year ago

The test you could here is just be network tests. If you attach a connection to your job, make sure your the security group has the necessary rules to facilitate the communication. A simple: os.system("nc -vz <ip-address> <port>") should prove connectivity before you start modifying your code.

Randomize Over a year ago

yes security groups and VPC/subnet are set properly as I already usedfor RDS connections. My doubt is about the whole process. I guess I have to download the driver for elastic search here: org.elasticsearch.xpack.sql.jdbc.EsDriver and "force" it to use my connection using something similar to the snippet I put up.

Eman Over a year ago

In my case I downloaded the jar and added it to the dependent jars path (uploaded to s3) and the rest I referred to the examples here docs.databricks.com/data/data-sources/elasticsearch.html

Randomize Over a year ago

In your case were you using EMR/Spark + AWS/ElasticSearch or Glue/Spark + AWS/ElasticSearch?

|

Collectives™ on Stack Overflow

Spark on Glue is not able to connect to AWS/ElasticSearch

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related