Elasticsearch + Apache Spark performance

Question

I am trying to use Apache spark to query my data in Elasticsearch but my spark job is taking about 20 hours to do an aggregation and still running. The same query in ES takes about 6 sec.

I understand the data has to move from Elasticsearch cluster to my spark cluster and some data shuffling in Spark.

The data inside my ES Index is approx. 300 million documents and each document has about 400 fields (1.4Terrabyte).

I've got a 3 node spark cluster(1 master, 2 workers) with 60GB of memory and 8 cores in total.

The time it takes to run is not acceptable, is there a way to make my spark job run faster ?

Here is my spark configuration:

SparkConf sparkConf = new SparkConf(true).setAppName("SparkQueryApp")
                 .setMaster("spark://10.0.0.203:7077")    
                 .set("es.nodes", "10.0.0.207")
                 .set("es.cluster", "wp-es-reporting-prod")              
                .setJars(JavaSparkContext.jarOfClass(Demo.class))
                .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                .set("spark.default.parallelism", String.valueOf(cpus * 2))
                .set("spark.executor.memory", "8g");

Edited

    SparkContext sparkCtx = new SparkContext(sparkConf);

    SQLContext sqlContext = new SQLContext(sparkCtx);
    DataFrame df = JavaEsSparkSQL.esDF(sqlContext, "customer-rpts01-201510/sample");

    DataFrame dfCleaned = cleanSchema(sqlContext, df);

    dfCleaned.registerTempTable("RPT");

    DataFrame sqlDFTest = sqlContext.sql("SELECT agent, count(request_type) FROM RPT group by agent");

    for (Row row : sqlDFTest.collect()) {
        System.out.println(">> " + row);
    }

How come you have datastax tagged here? If you're using DSE, the best performance you'll get is DSE Search queried from the embedded DSE Spark. — phact
– phact, Commented Jul 17, 2015 at 17:02
Okay thanks :) fwiw, my guess is that you're not doing the filtering at the ES layer. I'm not an expert on the ES Spark connector though. — phact
– phact, Commented Jul 17, 2015 at 17:17

Philip K. Adetiloye · Accepted Answer · 2015-08-25 20:34:10Z

5

I figured out what was going on, basically, I was trying to manipulate the dataframe schema because I have some fields with a dot e.g user.firstname. This seems to cause a problem in the collect phase of spark. To resolve this, I had to just re-index my data so my fields no longer have dot but an underscore e.g user_firstname.

answered Aug 25, 2015 at 20:34

Philip K. Adetiloye

3,2705 gold badges39 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

axlpado - Agile Lab · Accepted Answer · 2015-08-24 19:15:49Z

1

I'm afraid you can't perform a group by over 1.4 TB with only 120 GB of total RAM and achieve good performance. DF will try to load all data in memory/disk and only then it will perform group by. I don't think that at the moment spark/ES connector translates sql syntax in ES query language.

answered Aug 24, 2015 at 19:15

axlpado - Agile Lab

3533 silver badges10 bronze badges

2 Comments

Philip K. Adetiloye Over a year ago

Yes, it does translates sql syntax to ES query!

axlpado - Agile Lab Over a year ago

Ah you are using the 2.1 RC ... did you enable the parameter pushdown -> true ?

Collectives™ on Stack Overflow

Elasticsearch + Apache Spark performance

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related