0

I am running Spark inside Glue to write down to AWS/ElasticSearch with the following configuration for Spark:

  conf.set("es.nodes", s"$nodes/$indexName")
  conf.set("es.port", "443")
  conf.set("es.batch.write.retry.count", "200")
  conf.set("es.batch.size.bytes", "512kb")
  conf.set("es.batch.size.entries", "500")
  conf.set("es.index.auto.create", "false")
  conf.set("es.nodes.wan.only", "true")
  conf.set("es.net.ssl", "true")

however what I get is the following error:

diagnostics: User class threw exception: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
    at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
    at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)
    ....

I know in which "VPC" is running my ElasticSearch instance, but I am not sure how to set that for Glue/Spark or if it is a different problem. Any idea?

I have also tried to add a "glue jdbc" connection which should use the proper VPC connection but I am not sure how set it up properly:

  import scala.reflect.runtime.universe._
  def saveToEs[T <: Product : TypeTag](index: String, data: RDD[T]) =
    SparkProvider.glueContext.getJDBCSink(
      catalogConnection = "my-elasticsearch-connection",
      options = JsonOptions(
        "WHAT HERE?"
      ),
      transformationContext = "SinkToElasticSearch"
    ).writeDynamicFrame(DynamicFrame(
      SparkProvider.sqlContext.createDataFrame[T](data),
      SparkProvider.glueContext))

1 Answer 1

1

Try to create to create a dummy JDBC connection. The dummy connection will tell Glue the ES - VPC, subnet and security group. A test connection might not work but when you run your job with the connection, it will use the connection metadata to launch elastic network interface in your VPC to facilitate this communication. More on connections can be found here:

[1] https://docs.aws.amazon.com/glue/latest/dg/start-connecting.html

Sign up to request clarification or add additional context in comments.

7 Comments

thank you, I already tried that (please look at my updated answer) but not any luck so far.
The test you could here is just be network tests. If you attach a connection to your job, make sure your the security group has the necessary rules to facilitate the communication. A simple: os.system("nc -vz <ip-address> <port>") should prove connectivity before you start modifying your code.
yes security groups and VPC/subnet are set properly as I already usedfor RDS connections. My doubt is about the whole process. I guess I have to download the driver for elastic search here: org.elasticsearch.xpack.sql.jdbc.EsDriver and "force" it to use my connection using something similar to the snippet I put up.
In my case I downloaded the jar and added it to the dependent jars path (uploaded to s3) and the rest I referred to the examples here docs.databricks.com/data/data-sources/elasticsearch.html
In your case were you using EMR/Spark + AWS/ElasticSearch or Glue/Spark + AWS/ElasticSearch?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.