1

I'm new to Apache Spark and I'm trying to load some elasticsearch data from a scala script I'm running on it.

Here is my script:

import org.apache.spark.sql.SparkSession

val sparkSession = SparkSession.builder.appName("Simple Application").getOrCreate()
val options = Map("es.nodes" -> "x.x.x.x:9200", "pushdown" -> "true")

import sparkSession.implicits._
val df = sparkSession.read.format("org.elasticsearch.spark.sql").options(options).load("my_index-07.05.2018/_doc").limit(5).select("SomeField", "AnotherField", "AnotherOne")

df.cache()
df.show()

And it works, but It's terribly slow. Am I doing anything wrong here?

Connectivity shouldn't be an issue at all, the index I'm trying to query has at around 200k documents but I'm limiting the query to 5 results.

Btw I had to run the spark-shell (or submit) by passing the elasticsearch-hadoop dependency as a parameter in the command line (--packages org.elasticsearch:elasticsearch-hadoop:6.3.0). Is that the right way to do it? Is there any way to just build sbt package including all the dependencies?

Thanks a lot

2
  • 1
    Did you find an answer? I'm having the same problem with azure databricks, even with a large cluster Commented May 13, 2019 at 10:01
  • any solution guys? Commented Jun 28, 2019 at 13:01

1 Answer 1

1

Are you running this locally in a single machine? If so it could be normal... You will have to check your network your spark web ui etc...

About submitting all the dependencies without specifying them in the shell withing spark-submit what we usually create a FAT jar by using sbt assembly.

http://queirozf.com/entries/creating-scala-fat-jars-for-spark-on-sbt-with-sbt-assembly-plugin

Sign up to request clarification or add additional context in comments.

6 Comments

I'm indeed running it on my own laptop, but I'm just querying plain data from an index, not doing any maths at all. Is this still normal?
It's precisely when running df.show() that takes ages and stays just showing this "[Stage 2:> (0 + 2) / 2]"
What if you try to simplify it first ? val df = sparkSession.read.format("org.elasticsearch.spark.sql").options(options).load("my_index-07.05.2018/_doc").show
Don't do the limit and the select thing. Spark is lazy so until you call an action it will do nothing. What iIwant you to try to create a data frame allowing spark and the driver do her job without specifying a limit, you can limit the number of rows as well using show(10)
I did it and it’s very slow when running show, even show(5).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.