ElasticSearch query optimization - Java API

Question

I am newbie to ES and am searching on a record set of 100k data. Here is my mapping and setting JSON with which i have indexed my data:

setings.json

{
    "index": {
        "analysis": {
            "tokenizer": {
                "ngram_tokenizer": {
                    "type": "ngram",
                    "min_gram": 3,
                    "max_gram": 10
                }
            },
            "analyzer": {
                "ngram_tokenizer_analyzer": {
                    "type": "custom",
                    "tokenizer": "ngram_tokenizer"
                }
            }
        }
    }
}

mappings.json

{
    "product": {
        "properties": {
            "name": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "description": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "vendorModelNumber": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "brand": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "specifications": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "upc": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "storeSkuId": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "modelNumber": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            }
        }
    }
}

I need to query documents based on all the fields mentioned according to some priority. Here is my query to search for all the records.

BoolQueryBuilder query = QueryBuilders.boolQuery();
int boost = 7;

for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("name", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("description", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("modelNumber", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("vendorModelNumber", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("storeSkuId", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("upc", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("brand", "*" + str.toLowerCase() + "*").boost(boost));
}
client.prepareSearch(index).setQuery(query).setSize(200).setExplain(true).execute().actionGet();

The query does help me in searching data and works fine but my issue is that it takes a lot of time since I am using wildcard query. Can someone please help in optimising this query or guide me in finding the best-suited query for my search? TIA.

Why do you use wildcard queries in the first place? Having an ngram tokenizer with 3+, a normal match query should work with inputs longer than 2 characters. Or what is the reason for the ngram tokenizer anyway? A sidenote; with this analyzer (as defined) your queries will be case sensitive. Possibly intended, but quite unusual. — Slomo
– Slomo, Commented Aug 3, 2017 at 8:22
Thanks @Slomo you are right. I shouldnt have used wildcards with ngram. can i make it case insensitive? and with ngram i should be querying with term query or match which is the more optimal way? sorry if that is not a sensible question :) — DivyaMenon
– DivyaMenon, Commented Aug 3, 2017 at 8:39

Slomo · Accepted Answer · 2017-08-03 09:12:57Z

1

First off, let me answer the simple question first: handle case sensitivity. If you define a custom analyzer, you can add different filters, which are applied to each token after the input has been processed by the tokenizer.

{
"index": {
    "analysis": {
        "tokenizer": {
            "ngram_tokenizer": {
                "type": "ngram",
                "min_gram": 3,
                "max_gram": 10
            }
        },
        "analyzer": {
            "ngram_tokenizer_analyzer": {
                "type": "custom",
                "tokenizer": "ngram_tokenizer",
                "filter": [
                    "lowercase",
                    ...
                ]
            }
        }
    }
}

As you see, there is an existing lowercase filter, which will simply transform all tokens to lower case. I strongly recommend referring to the documentation. There are a lot of these token filters.

Now the more complicated part: NGram tokenizers. Again, for deeper understanding, you might want to read docs. But referring to your problem, your tokenizer will essentially create terms of length 3 to 10. Which means the text

I am an example TEXT.

Will basically create a lot of tokens. Just to show a few:

Size 3: "I a", " am", "am ", ..., "TEX", "EXT"
Size 4: "I am", " am ", "am a", ..., " TEX", "TEXT".
Size 10: "I am an ex", ...

You get the idea. (The lowercase token filter would lowercase these tokens now)

Difference between Match and Term Query: Match queries are analyzed, while term queries are not. In fact, that means your match queries can match multiple terms. Example: you match exam".

This would match 3 terms in fact: exa, xam and exam.

This has influence on the score of the matches. The more matches, the higher the score. In some cases it's desired, in other cases not.

A term query is not analyzed, which means exam would match, but only one term (exam of course). However, since it's not analyzed, it's also not lowercased, meaning you have to do that in code yourself. Exam would never match, because there is no term with capital letters in your index, if you use the lowercase tokenfilter.

Not sure about your use-case. But I have a feeling, that you could (or even want) indeed use the term query. But be aware, there are no terms in your index with a size bigger than 10. Because that's what your ngram-tokenizer does.

/ EDIT:

Something worth pointing out regarding match queries, and the reason why you might want to use terms: Some match queries like Simple will also match mple from example.

edited Aug 3, 2017 at 9:12

answered Aug 3, 2017 at 9:03

Slomo

1,2448 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

DivyaMenon Over a year ago

tnx a lot @Slomo for the detailed explanation. will refine my code an dwill go through the doc as well. :)

DivyaMenon Over a year ago

so suppose I need to search multiple values on multiple fields, bool with match query would be a good option right?

Slomo Over a year ago

@DivyaMenon You can. Or maybe you can also use the multiMatch where you should also be able to weight fields. Maybe a concrete query example with expected results would help to answer your question.

Collectives™ on Stack Overflow

ElasticSearch query optimization - Java API

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related