2

I am using elasticsearch to build the index of URLs.

I extracted one URL into 3 parts which is "domain", "path", and "query".

For example: testing.com/index.html?user=who&pw=no will be separated into

domain = testing.com
path = index.html
query = user=who&pw=no

There is problems when I wanted to partial search domain in my index such as "user=who" or "ing.com".

Is it possible to use "Analyzer" when I search even I didn't use "Analyzer" when indexing?

How can I do partial search based on the analyzer ?

Thank you very much.

2 Answers 2

6

2 approaches:

1. Wildcard search - easy and slow

"query": {
    "query_string": {
        "query": "*ing.com",
        "default_field": "domain"
    }
}

2. Use an nGram tokenizer - harder but faster

Index Settings

"settings" : {
    "analysis" : {
        "analyzer" : {
            "my_ngram_analyzer" : {
                "tokenizer" : "my_ngram_tokenizer"
            }
        },
        "tokenizer" : {
            "my_ngram_tokenizer" : {
                "type" : "nGram",
                "min_gram" : "1",
                "max_gram" : "50"
            }
        }
    }
}

Mapping

"properties": {
    "domain": {
        "type": "string",
        "index_analyzer": "my_ngram_analyzer"
    },
    "path": {
        "type": "string",
        "index_analyzer": "my_ngram_analyzer"
    },
    "query": {
        "type": "string",
        "index_analyzer": "my_ngram_analyzer"
    }
}

Querying

"query": {
    "match": {
        "domain": "ing.com"
    }
}
Sign up to request clarification or add additional context in comments.

2 Comments

If I didn't use "Analyzer" to index, can I use "Analyzer" to search ?
"Analyzer" is not just one type, There are many ways to do it. In short, yes you can use a different analyzer when you search to when you index. Analysis takes the raw data and produces tokens. When you search the data in the query is analyzed and these query tokens are matched agains the tokens created at index time. Therefore the results depend on both index and search analyzers. There are some searches you can't do with only search-time analysis that you need to set an explicit mapping for.
-1

Trick with query string is split string like "user=who&pw=no" to tokens ["user=who&pw=no", "user=who", "pw=no"] at index time. That allows you to make easily queries like "user=who". You could do this with pattern_capture token filter, but there may be better ways to do this as well.

You can also make hostname and path more searchable with path_hierarchy tokenizer, for example "/some/path/somewhere" becomes ["/some/path/somewhere", "/some/path/", "/some"]. You can index also hostname with with path_hierarchy hierarcy tokenizer by using setting reverse: true and delimiter: ".". You may also want to use some stopwords-filter to exclude top-level domains.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.