2

I've been using ElasticSearch for a little bit with the goal of building a search engine and I'm interested in manually changing the IDFs (Inverse Document Frequencies) of each term to match the ones one can measure from the Google Books unigrams.

In order to do that I plan on doing the following:

1) Use only 1 shard (so IDFs are not computed for every shard and they are "global")

2) Get the ttf (total term frequency, which is used to compute the IDFs) for every term by running this query for every document in my index

curl -XGET 'http://localhost:9200/index/document/id_doc/_termvectors?pretty=true' -d '{
  "fields" : ["content"],
  "offsets" : true,
  "term_statistics" : true
}'

3) Use the Google Books unigram model to "rescale" the ttf for every term.

The problem is that, once I've found the "boost" factors I have to use for every term, how can I use this in a query?

For instance, let's consider this example

"query":
{  
    "bool":{  
        "should":[  
            {  
                "match":{  
                    "title":{  
                        "query":"cat",
                        "boost":2
                    }
                }
            },
            {  
                "match":{  
                    "content":{  
                        "query":"cat",
                        "boost":2
                    }
                }
            }
        ]
    }
}

Does that mean that the IDFs of the term "cat" is going to be boosted / multiplied by a factor of 2?

Also, what happens if instead of search for one word I have a sentence? Would that mean that the IDFs of each word is going to be boosted by 2?

I tried to understand the role of the boost parameter (https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html) and t.getBoost(), but that seems a little confusing.

1 Answer 1

2

The boost is used when query with multi query clauses, example:

{  
    "bool":{  
        "should":[  
            {  
                "match":{  
                    "clause1":{  
                        "query":"query1",
                        "boost":3
                    }
                }
            },
            {  
                "match":{  
                    "clause2":{  
                        "query":"query2",
                        "boost":2
                    }
                }
            },
            {  
                "match":{  
                    "clause3":{  
                        "query":"query1",
                        "boost":1
                    }
                }
            }
        ]
    }
}

In the above query, it means clause1 is three times important than clause3, clause2 is the twice important than clause2, It's not simply multiply 3, 2, because when calculate score, because there is normalized for scores.

also if you just query with one query clause with boost, it's not useful.

An usage scenario for using boost:

A set of page document set with title and content field.

You want to search title and content with some terms, and you think title is more important than content when search these documents. so you can set title query clause boost more than content. Such as if your query hit one document by title field, and one hit document by content field, and you want to hit title field's document prior to the content field document. so boost can help you do it.

Sign up to request clarification or add additional context in comments.

5 Comments

thank for the answer! I was wondering, however, if there's a way to boost specific words. On the ElasticSearch guide they mention t.getBoost() to do that, but I don't understand how that can be used in practice.
@Brian, t.getBoost() means when we set boost in query, the calculation function will get this boost by this method t.getBoost. and the boost is for increase the query clause's weight,
ok, thanks! but how exactly is "boost" used in the score? This is the description of the score in Lucene (elastic.co/guide/en/elasticsearch/guide/current/…), but it's not clear to me what happens when there's more than one term in the query.
t.getBoost is used when calculate the document score(like the tf idf factor), Do you mean multi terms query or multi query clause?
I mean a multi-term query. What would the boost parameter do in that case? Is that multiplying the tf * idf factor in the scoring formula?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.