I am new to ElasticSearch. I have a huge data to index using Elasticsearch.
I am use Apache Spark to index the data in hive table using Elasticsearch.
as part of this functionality, i wrote simple Spark Script.
object PushToES {
def main(args: Array[String]) {
val Array(inputQuery, index, host) = args
val sparkConf = new SparkConf().setMaster("local[1]").setAppName("PushToES")
sparkConf.set("....",Host)
sparkConf.set("....","9200")
val sc = new SparkContext(sparkConf)
val ht = new org.apache.spark.sql.hive.HiveContext(sc)
val ps = hhiveSqlContext.sql(inputQuery)
ps.toJSON.saveJsonToEs(index)
}
}
After that I am generating jar and submitting the job by using spark-submit
spark-submit --jars ~/*.jar --master local[*] --class com.PushToES *.jar "select * from gtest where day=20170711" gest3 localhost
then I am executing the below command for
curl -XGET 'localhost:9200/test/test_test/_count?pretty'
first time it is showing properly
{
"count" : 10,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
if i execute second time same curl command it is giving result like bleow
{
"count" : 20,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
if i execute 3rd time same command i am getting
{
"count" : 30,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
But I am not understanding every time why it is adding count value to existing index value(i.e. Count)
Please let me know how can I resolve this issue i.e . if I am execute any number of time also I have to get same value (correct count value i.e 10)
I am expecting below result for this case because correct count value is 10.(I executed count query on hive table for getting every time count(*) as 10)
{
"count" : 10,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
Thanks in advance .