1

I am trying to use Apache spark to create an index in Elastic search(Writing huge data to ES).I have done a Scala program to create index using Apache spark.I have to index huge data, which is getting as my product bean in a LinkedList. Then.Then i tried to traverse over the product bean list and create the index. My code given below.

val conf = new SparkConf().setAppName("ESIndex").setMaster("local[*]")
    conf.set("es.index.auto.create", "true").set("es.nodes", "127.0.0.1")
      .set("es.port", "9200")
      .set("es.http.timeout", "5m")
      .set("es.scroll.size", "100")

    val sc = new SparkContext(conf)

    //Return my product bean as a in a linkedList.
    val list: util.LinkedList[product] = getData() 

    for (item <- list) {
      sc.makeRDD(Seq(item)).saveToEs("my_core/json")
    }

The issue with this approach is taking too much time to create the index. Is there any way to create the index in a better way?

2
  • 2
    Why do you pass data through driver? This is an obvious bottleneck. Commented Mar 9, 2016 at 9:56
  • 2
    plus if your data fits in memory why to overhead your architecture with spark ? Commented Mar 9, 2016 at 10:00

1 Answer 1

3
  1. Don't pass data through driver unless it is necessary. Depending on what is the source of data returned from getData you should use relevant input method or create your own. If data comes from MongoDB use for example mongo-hadoop, Spark-MongoDB or Drill with JDBC connection. Then use map or similar method to build the required objects and use saveToEs on transformed RDD.

  2. Creating a RDD with as single element doesn't make sense. It doesn't benefit from Spark architecture at all. You just start a potentially huge number of tasks which have nothing with only a single active executor.

Sign up to request clarification or add additional context in comments.

2 Comments

so is there way to process the entire list than a single object through driver?
Once again. Don't use driver to pass or process the data. If you do it, and process data locally there is no reason to use Spark.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.