4

I've got an index of about 3gb on a 20 node Elasticsearch cluster running in AWS EMR. The index has 5 shards and is replicated 4 times. The underlying data are books, but I've split them into paragraph or line chunks, depending on formatting, so there are about 27million documents. Indexing only takes ~5 mins.

I have about 15 million phrases I want to search for in the index.

The search logic is a 4 layered waterfall that stops once results are found: exact match => fuzzy match with an edit distance of 1 => fuzzy match with an edit distance of 2 => partial phrase match. I broke it down this way, so I could filter the matches by some quality measure.

In order to distribute and execute the searches, I'm using Spark.

I'm finding that the quickest I can get the search to run is about 420 phrases a second, which means the entire task will take 10-12 hours.

My question is: is that a reasonable search rate?

Would I get better performance if I had the entire index on one shard and replicated the full index on each node? Or should I go the other direction and increase the sharding level? I suspect the answers to these two questions will be "Try both!", which I'll likely do in the long run, but I have a short term deadline that I'm trying to optimize for, so I wanted to see if anyone else had experience with a similar problem.

I'm happy to provide more details as need.

Apologies if this isn't on topic - I'm not finding a lot of documentation on this sort of use case for Elasticsearch.

1 Answer 1

2

Having only 3gb of data on 20 nodes is a waste of resources. If you have a 5-shard index, start with 5 nodes only. Heck, 3gb is so small that you could even have that index only contain a single shard and run on a single node.

You're lucky that it only takes 5 minutes to index all your data, as you'll quickly find the right cluster size to run your queries optimally. Start with one primary shard (no replica) on one node, then add one replica and another node, etc

Then start over with two primary shards and two nodes, add replicas and nodes, etc.

For each test, measure how fast it goes, and at some point (i.e. within a day or two) you'll find the exact cluster size that is optimal for your search requirements.

UPDATE

If you have 32 CPUs per node, you can have a single node and your twenty shards on it. During each search, each CPU will happily tackle one shard, there will be less network chatter to aggregate the results and it "should" be faster. I would definitely try it out.

Sign up to request clarification or add additional context in comments.

9 Comments

Agreed - the 20 nodes are necessary to increase the parallel capacity of the search - without those, all things equal, that 340 searches a second would drop to about 85 searches a second.
So you basically have 20 shards and 20 nodes, so one shard per node, right? how many CPUs do you have on each node?
If you have 32 CPUs per node, you can have a single node and your twenty shards on it. During each search, each CPU will happily tackle one shard, there will be less network chatter to aggregate the results and it "should" be faster. I would definitely try it out.
+1 for the answer and your second comment. I also believe that the upvoted comment should be in your answer.
Thanks @eliasah, makes sense, I've updated my answer accordingly
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.