1

Using Lucene, certain queries parse and execute in a completely unexpected way.

Here's the code for testing it (written in Scala, but can be easily translated to Java too):

import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.document._
import org.apache.lucene.index._
import org.apache.lucene.queryparser.classic.QueryParser
import org.apache.lucene.search.{IndexSearcher, Query}
import org.apache.lucene.store._
import scala.jdk.CollectionConverters._

object LuceneTestUtils{

  def testDocuments(
    docs: Seq[Map[String, String]],
    query: String,
    expected: Seq[Map[String, String]]
  ): Unit = {
    withIndex(docs) { searcher =>
      val analyzer = new StandardAnalyzer()
      val parser = new QueryParser("defaultField", analyzer)
      val q: Query = parser.parse(query)
      println(q)
      val hits = searcher.search(q, 1000).scoreDocs
      val results = hits.map { hit =>
        val doc = searcher.doc(hit.doc)
        doc.getFields.asScala.map(f => f.name() -> doc.get(f.name())).toMap
      }
      .toList

      val notExpected = results.diff(expected)
      assert(notExpected.isEmpty, s"Got unexpected documents:\n${notExpected.mkString("\n")}\nReceived documents:\n${results.mkString("\n")}")

      val missing = expected.diff(results)
      assert(missing.isEmpty, s"Missing expected documents: \n${missing.mkString("\n")}\nReceived documents:\n${results.mkString("\n")}")
    }
  }

  def withIndex(
      docs: Seq[Map[String, String]]
  )(test: IndexSearcher => Unit): Unit = {
    val analyzer = new StandardAnalyzer()
    val index: Directory =
      new org.apache.lucene.store.ByteBuffersDirectory(NoLockFactory.INSTANCE)
    val config = new IndexWriterConfig(analyzer)
    val writer = new IndexWriter(index, config)

    docs.foreach { fields =>
      val doc = new Document()
      fields.foreach { case (k, v) =>
        doc.add(new StringField(k, v, Field.Store.YES))
      }
      writer.addDocument(doc)
    }
    writer.close()

    val reader = DirectoryReader.open(index)
    val searcher = new IndexSearcher(reader)
    try {
      test(searcher)
    } finally {
      reader.close()
      index.close()
    }
  }
}

Dependencies used:

  "org.apache.lucene" % "lucene-core" % "9.9.2",
  "org.apache.lucene" % "lucene-queryparser" % "9.9.2",

I generate some test data:

  private val testData =  for {
      key1Value <- Seq(Some("value1"), Some("value2"), None)
      key2Value <- Seq(Some("value1"), Some("value2"), None)
      key3Value <- Seq(Some("value1"), Some("value2"), None)
      key4Value <- Seq(Some("value1"), Some("value2"), None)
    } yield Seq(
      key1Value.map("key1" -> _),
      key2Value.map("key2" -> _),
      key3Value.map("key3" -> _),
      key4Value.map("key4" -> _)
    ).flatten.toMap

Most queries work exactly as expected, e.g.:

1.

    val expectedDocs = testData
      .filter(doc =>
        (doc.get("key1").contains("value2") || doc.get("key2").contains("value2")) &&
        (doc.get("key3").contains("value2") || doc.get("key4").contains("value2"))
      )

    LuceneTestUtils.testDocuments(testData, "(key1:value2 OR key2:value2) AND (key3:value2 OR key4:value2)", expectedDocs)

The following test cases show very unexpected behaviour:

2.

   val expectedDocs = testData
      .filter(doc => 
        doc.get("key1").contains("value1") &&
        // This feels very wrong, it should be ||
        doc.get("key2").contains("value1") &&
        doc.get("key3").contains("value1") &&
        doc.get("key4").contains("value1")
      )

    LuceneTestUtils.testDocuments(testData, "key1:value1 AND key2:value1 OR key3:value1 AND key4:value1", expectedDocs)
    val expectedDocs = testData
      .filter(doc =>
        // ???
        // doc.get("key1").contains("value2") || (
        doc.get("key2").contains("value2") &&
          doc.get("key3").contains("value2")
        // ) || doc.get("key4").contains("value2")
      )

    LuceneTestUtils.testDocuments(testData, "key1:value2 OR key2:value2 AND key3:value2 OR key4:value2", expectedDocs)

No operator precedence could explain this.

  • AND before OR
  • AND equal to OR
    • with either left-to-right or right-to-left associativity
  • AND after OR

A hint at the problem is the output of that println(q):

1.

+(key1:value2 key2:value2) +(key3:value2 key4:value2)
+key1:value1 +key2:value1 +key3:value1 +key4:value1
key1:value2 +key2:value2 +key3:value2 key4:value2

These match the observed behaviour, but don't make sense. Why does Lucene work in that way?

4
  • I don't have a precise answer but the Classic QueryParser you're using is documented to be used by end users simple queries (link). Did you try another parser out of curiosity? Commented Nov 5 at 18:38
  • 1
    Related: Lucene operator precedence for boolean operators. There is also a great thread (old but still highly relevant) from the Lucene mailing list: Getting a Better Understanding of Lucene's Search Operators. Commented Nov 5 at 21:22
  • 1
    Bottom line: these "boolean" operators AND and OR were added to Lucene as a way to try to be more inuitive than + (meaning "must exist") along with "may exist" and "must not exist". These are the fundamental Lucene operators here. We have to remember that Lucene's overall objective is not a pure boolean "hit" or "miss" process - but rather a scoring process (more relevant vs. less relevant matches, ranked in order). Commented Nov 5 at 21:22
  • This question is similar to: Lucene operator precedence for boolean operators. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. Commented Nov 6 at 2:22

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.