I've got an index of about 3gb on a 20 node Elasticsearch cluster running in AWS EMR. The index has 5 shards and is replicated 4 times. The underlying data are books, but I've split them into paragraph or line chunks, depending on formatting, so there are about 27million documents. Indexing only takes ~5 mins.
I have about 15 million phrases I want to search for in the index.
The search logic is a 4 layered waterfall that stops once results are found: exact match => fuzzy match with an edit distance of 1 => fuzzy match with an edit distance of 2 => partial phrase match. I broke it down this way, so I could filter the matches by some quality measure.
In order to distribute and execute the searches, I'm using Spark.
I'm finding that the quickest I can get the search to run is about 420 phrases a second, which means the entire task will take 10-12 hours.
My question is: is that a reasonable search rate?
Would I get better performance if I had the entire index on one shard and replicated the full index on each node? Or should I go the other direction and increase the sharding level? I suspect the answers to these two questions will be "Try both!", which I'll likely do in the long run, but I have a short term deadline that I'm trying to optimize for, so I wanted to see if anyone else had experience with a similar problem.
I'm happy to provide more details as need.
Apologies if this isn't on topic - I'm not finding a lot of documentation on this sort of use case for Elasticsearch.