I am encountering an error while trying to connect Spark to Elasticsearch and insert a DataFrame. The specific error message is as follows:
Py4JJavaError: An error occurred while calling o130.save. : org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
Here is the relevant code snippet:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
appName = "mysql example"
master = "local"
spark = SparkSession.builder.master(master).appName(appName)\
.config("spark.jars", "postgresql-42.2.6.jar,elasticsearch-spark-30_2.12-8.8.1.jar").getOrCreate()
category_product_df = dataframe_from_table("category")
category_product_df.write.format("org.elasticsearch.spark.sql") \
.option("es.resource", "wael/test") \
.option("es.port", "9200") \
.option("es.nodes", "elastic:changeme@localhost") \
.option("es.nodes.wan.only", "true") \
.save()
When attempting to connect to Elasticsearch, I encountered the aforementioned error.
Here are some details about my environment:
PySpark version: 3.4.0
Scala version: 2.12.17
Elasticsearch version: 7.15.0
Elasticsearch-Spark connector version: elasticsearch-spark-30_2.12-8.8.1.jar
I have verified the accessibility of the Elasticsearch cluster and made sure that the necessary network connectivity is in place. Additionally, I have checked the Elasticsearch cluster configuration and confirmed that the host, port, and authentication credentials are correct.
I have also noticed that the error suggests setting the 'es.nodes.wan.only' property to 'true' when targeting a WAN/Cloud instance of Elasticsearch. As such, I have included this configuration option in the code snippet.
Despite these efforts, I am still encountering the error. I would appreciate any insights or suggestions on how to resolve this issue and successfully connect Spark to Elasticsearch.