I'm new to Apache Spark and I'm trying to load some elasticsearch data from a scala script I'm running on it.
Here is my script:
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder.appName("Simple Application").getOrCreate()
val options = Map("es.nodes" -> "x.x.x.x:9200", "pushdown" -> "true")
import sparkSession.implicits._
val df = sparkSession.read.format("org.elasticsearch.spark.sql").options(options).load("my_index-07.05.2018/_doc").limit(5).select("SomeField", "AnotherField", "AnotherOne")
df.cache()
df.show()
And it works, but It's terribly slow. Am I doing anything wrong here?
Connectivity shouldn't be an issue at all, the index I'm trying to query has at around 200k documents but I'm limiting the query to 5 results.
Btw I had to run the spark-shell (or submit) by passing the elasticsearch-hadoop dependency as a parameter in the command line (--packages org.elasticsearch:elasticsearch-hadoop:6.3.0). Is that the right way to do it? Is there any way to just build sbt package including all the dependencies?
Thanks a lot