1

Update. (+18d) edited title and provided answer addressing original question.


tl/dr

I am indexing HTML pages and dumping the <p>...</p> content as a snippet for search query returns. However, I don't want / need all that content (just the context around the query matched text).

Background

With these in my [classic] schema,

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" 
autoGeneratePhraseQueries="true" multiValued="true">

<field name="p" type="text_general" indexed="true" stored="true" multiValued="true" 
omitNorms="true" termVectors="true" />

and these in my solrconfig.xml

<str name="queryAnalyzerFieldType">text_general</str>

<updateProcessor class="solr.AddSchemaFieldsUpdateProcessorFactory" name="add-schema-fields">
  <lst name="typeMapping">
    <str name="valueClass">java.lang.String</str>
    <str name="fieldType">text_general</str>
    <lst name="copyField">
      <str name="dest">*_str</str>
      <int name="maxChars">256</int>
    </lst>
    ...

<initParams path="/update/**,/query,/select,/spell">
  <lst name="defaults">
    <str name="df">_text_</str>
  </lst>
</initParams>

<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>
    <str name="capture">div</str>
    <str name="fmap.div">div</str>
    <str name="capture">p</str>
    <str name="fmap.p">p</str>
    <str name="processor">uuid,remove-blank,field-name-mutating,parse-boolean,
               parse-long,parse-double,parse-date</str>
  </lst>
</requestHandler>

<requestHandler name="/query" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <str name="wt">json</str>
    <str name="indent">true</str>
  </lst>
</requestHandler>

<queryResponseWriter name="json" class="solr.JSONResponseWriter">
  <!-- For the purposes of the tutorial, JSON responses are written as
   plain text so that they are easy to read in *any* browser.
   If you expect a MIME type of "application/json" just remove this override.
  -->
  <str name="content-type">text/plain; charset=UTF-8</str>
</queryResponseWriter>


I get this result [Solr Admin UI; facsimile shown here],

"p":["Sentence 1. Sentence 2. Sentence 3. Sentence 4. ..."]

In the source HTML document those sentences occur singly in p-tags, e.g. <p>Sentence 1.</p>, <p>Sentence 1.</p>, ...

Questions

  1. How can I index them, singly? My rationale is that I want to display a snippet of the context around the search result target (not the entire p-tagged content).

  2. Additionally, in the Linux grep command we can, e.g., return a line before and after the matched line (-C1, context, argument). Can we do something similar, here?

    i.e., if the Solr query match is in Sentence 2, the snippet would contain Sentences 1-3?

I tried assigning unique id's to the p-elements (<p id="a">...</p> <p id="b">...</p> but I just got this in Solr,

"p":["a Sentence 1. b Sentence 2. Sentence d 3. Sentence d 4. ..."]
2
  • 1
    Have you looked at lucene.apache.org/solr/guide/8_7/highlighting.html ? Commented Dec 13, 2020 at 14:44
  • @MatsLindh: thank you for the suggestion; coincidentally I started looking at that before I went to bed; I think it looks promising! :-) Commented Dec 13, 2020 at 17:05

1 Answer 1

0

Update [2020-12-31]

  • Please overlook the answering of my own question, as 18 days have passed with one comment and no answers.

I am building a search page with Solr as the backend, inspired by the following Ajax Solr tutorial. https://github.com/evolvingweb/ajax-solr

Ultimately, I decided to forgo Solr highlighting in favor of a more flexible, bespoke JavaScript (JS) solution.

Basically, I:

  • collect the Solr query (q) and filter query (fq) values (terms) in an array (simplified example shown below; more complete JS code appended)

    for (var i = 0, l = this.manager.response.response.docs.length; i < l; i++) {
        var doc = this.manager.response.response.docs[i];
    }
    
  • extract sentences matching those terms (words) via a JS regex expression

    var mySentences = doc_p.replace(/([.?!])\s*(?=['"A-Z])/g, "$1|").split("|");
    

    where doc.p is a Solr field (defined in schema.xml) corresponding to indexed HTML p-element (<p>...</p>) text.

  • highlight those query terms

    var query = this.manager.store.get('q').value;  /* or loop over array */
    
    const replacer = (str, replace) => {
        const re = new RegExp(`(${replace})`, 'gi')
        return str.replaceAll(re, '<font style="background:#FFFF99">$1</font>')
    }
    var doc_p_hl = replacer(doc.p.toString(), query);
    
  • use those term-highlighted strings as snippets on the frontend

  • apply a similar approach to the highighting of query terms in the full documents, doc.p.toString() ...


Addendum

Here is the JS code I wrote to collect Solr "q" and "fq" terms in an array. Note that Solr returns single fq as a string, and multiple fq terms as an array.

var q_arr = [];
var fq_arr = [];
var highlight_arr = [];
var snippets_arr = [];
var fq_vals = [];

if ((this.manager.store.get('q').value !== undefined) &&
    (this.manager.store.get('q').value !== '*:*')) {
    query = this.manager.store.get('q').value;
    q_arr.push(query);
    highlight_arr.push(query);
    console.log('q_arr:', q_arr, '| type:', typeof q_arr, '| length:', q_arr.length)
}

var doc_responseHeader = this.manager.response.responseHeader;
if (doc_responseHeader.params.fq !== undefined) {

    /* ONE "fq" (FILTER QUERY) TERM: */
    if (typeof doc_responseHeader.params.fq === 'string' ||
        doc_responseHeader.params.fq instanceof String) {
        fq_arr.push(doc_responseHeader.params.fq);
    }

    /* MORE THAN ONE "fq" (FILTER QUERY) TERM: */
    if  (typeof doc_responseHeader.params.fq === 'object' ||
        doc_responseHeader.params.fq instanceof Object) {

        for (var i = 0, l = doc_responseHeader.params.fq.length; i < l; i++) {
            fq_arr.push(doc_responseHeader.params.fq[i].toString());
        }
    }

    fq_vals = fq_arr.map(function(x){return x.replace(/keywords:/g, '');})
    console.log('fq_vals', fq_vals, '| type:', typeof fq_vals, '| length:', fq_vals.length)

    for (var i = 0, l = fq_vals.length; i < l; i++) {
        highlight_arr.push(fq_vals[i].toString());
    }
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.