Understanding boosting in ElasticSearch

Question

I've been using ElasticSearch for a little bit with the goal of building a search engine and I'm interested in manually changing the IDFs (Inverse Document Frequencies) of each term to match the ones one can measure from the Google Books unigrams.

In order to do that I plan on doing the following:

1) Use only 1 shard (so IDFs are not computed for every shard and they are "global")

2) Get the ttf (total term frequency, which is used to compute the IDFs) for every term by running this query for every document in my index

curl -XGET 'http://localhost:9200/index/document/id_doc/_termvectors?pretty=true' -d '{
  "fields" : ["content"],
  "offsets" : true,
  "term_statistics" : true
}'

3) Use the Google Books unigram model to "rescale" the ttf for every term.

The problem is that, once I've found the "boost" factors I have to use for every term, how can I use this in a query?

For instance, let's consider this example

"query":
{  
    "bool":{  
        "should":[  
            {  
                "match":{  
                    "title":{  
                        "query":"cat",
                        "boost":2
                    }
                }
            },
            {  
                "match":{  
                    "content":{  
                        "query":"cat",
                        "boost":2
                    }
                }
            }
        ]
    }
}

Does that mean that the IDFs of the term "cat" is going to be boosted / multiplied by a factor of 2?

Also, what happens if instead of search for one word I have a sentence? Would that mean that the IDFs of each word is going to be boosted by 2?

I tried to understand the role of the boost parameter (https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html) and t.getBoost(), but that seems a little confusing.

chengpohi · Accepted Answer · 2016-08-02 16:18:28Z

2

The boost is used when query with multi query clauses, example:

{  
    "bool":{  
        "should":[  
            {  
                "match":{  
                    "clause1":{  
                        "query":"query1",
                        "boost":3
                    }
                }
            },
            {  
                "match":{  
                    "clause2":{  
                        "query":"query2",
                        "boost":2
                    }
                }
            },
            {  
                "match":{  
                    "clause3":{  
                        "query":"query1",
                        "boost":1
                    }
                }
            }
        ]
    }
}

In the above query, it means clause1 is three times important than clause3, clause2 is the twice important than clause2, It's not simply multiply 3, 2, because when calculate score, because there is normalized for scores.

also if you just query with one query clause with boost, it's not useful.

An usage scenario for using boost:

A set of page document set with title and content field.

You want to search title and content with some terms, and you think title is more important than content when search these documents. so you can set title query clause boost more than content. Such as if your query hit one document by title field, and one hit document by content field, and you want to hit title field's document prior to the content field document. so boost can help you do it.

edited Aug 2, 2016 at 16:18

answered Aug 2, 2016 at 9:59

chengpohi

14.2k1 gold badge28 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Brian Over a year ago

thank for the answer! I was wondering, however, if there's a way to boost specific words. On the ElasticSearch guide they mention t.getBoost() to do that, but I don't understand how that can be used in practice.

chengpohi Over a year ago

@Brian, t.getBoost() means when we set boost in query, the calculation function will get this boost by this method t.getBoost. and the boost is for increase the query clause's weight,

Brian Over a year ago

ok, thanks! but how exactly is "boost" used in the score? This is the description of the score in Lucene (elastic.co/guide/en/elasticsearch/guide/current/…), but it's not clear to me what happens when there's more than one term in the query.

chengpohi Over a year ago

t.getBoost is used when calculate the document score(like the tf idf factor), Do you mean multi terms query or multi query clause?

Brian Over a year ago

I mean a multi-term query. What would the boost parameter do in that case? Is that multiplying the tf * idf factor in the scoring formula?

Collectives™ on Stack Overflow

Understanding boosting in ElasticSearch

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related