Apache spark - loading data from elasticsearch is too slow

Question

I'm new to Apache Spark and I'm trying to load some elasticsearch data from a scala script I'm running on it.

Here is my script:

import org.apache.spark.sql.SparkSession

val sparkSession = SparkSession.builder.appName("Simple Application").getOrCreate()
val options = Map("es.nodes" -> "x.x.x.x:9200", "pushdown" -> "true")

import sparkSession.implicits._
val df = sparkSession.read.format("org.elasticsearch.spark.sql").options(options).load("my_index-07.05.2018/_doc").limit(5).select("SomeField", "AnotherField", "AnotherOne")

df.cache()
df.show()

And it works, but It's terribly slow. Am I doing anything wrong here?

Connectivity shouldn't be an issue at all, the index I'm trying to query has at around 200k documents but I'm limiting the query to 5 results.

Btw I had to run the spark-shell (or submit) by passing the elasticsearch-hadoop dependency as a parameter in the command line (--packages org.elasticsearch:elasticsearch-hadoop:6.3.0). Is that the right way to do it? Is there any way to just build sbt package including all the dependencies?

Thanks a lot

Did you find an answer? I'm having the same problem with azure databricks, even with a large cluster — Renan Ferreira
– Renan Ferreira, Commented May 13, 2019 at 10:01

servatj · Accepted Answer · 2018-07-09 14:14:24Z

1

Are you running this locally in a single machine? If so it could be normal... You will have to check your network your spark web ui etc...

About submitting all the dependencies without specifying them in the shell withing spark-submit what we usually create a FAT jar by using sbt assembly.

http://queirozf.com/entries/creating-scala-fat-jars-for-spark-on-sbt-with-sbt-assembly-plugin

answered Jul 9, 2018 at 14:14

servatj

1962 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

magnoz Over a year ago

I'm indeed running it on my own laptop, but I'm just querying plain data from an index, not doing any maths at all. Is this still normal?

magnoz Over a year ago

It's precisely when running df.show() that takes ages and stays just showing this "[Stage 2:> (0 + 2) / 2]"

servatj Over a year ago

What if you try to simplify it first ? val df = sparkSession.read.format("org.elasticsearch.spark.sql").options(options).load("my_index-07.05.2018/_doc").show

servatj Over a year ago

Don't do the limit and the select thing. Spark is lazy so until you call an action it will do nothing. What iIwant you to try to create a data frame allowing spark and the driver do her job without specifying a limit, you can limit the number of rows as well using show(10)

magnoz Over a year ago

I did it and it’s very slow when running show, even show(5).

|

Collectives™ on Stack Overflow

Apache spark - loading data from elasticsearch is too slow

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related