I am running Spark inside Glue to write down to AWS/ElasticSearch with the following configuration for Spark:
conf.set("es.nodes", s"$nodes/$indexName")
conf.set("es.port", "443")
conf.set("es.batch.write.retry.count", "200")
conf.set("es.batch.size.bytes", "512kb")
conf.set("es.batch.size.entries", "500")
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.wan.only", "true")
conf.set("es.net.ssl", "true")
however what I get is the following error:
diagnostics: User class threw exception: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)
....
I know in which "VPC" is running my ElasticSearch instance, but I am not sure how to set that for Glue/Spark or if it is a different problem. Any idea?
I have also tried to add a "glue jdbc" connection which should use the proper VPC connection but I am not sure how set it up properly:
import scala.reflect.runtime.universe._
def saveToEs[T <: Product : TypeTag](index: String, data: RDD[T]) =
SparkProvider.glueContext.getJDBCSink(
catalogConnection = "my-elasticsearch-connection",
options = JsonOptions(
"WHAT HERE?"
),
transformationContext = "SinkToElasticSearch"
).writeDynamicFrame(DynamicFrame(
SparkProvider.sqlContext.createDataFrame[T](data),
SparkProvider.glueContext))