Lucene Boolean Operator Problems

Ask Question

Asked 14 days ago

Modified 13 days ago

Viewed 82 times

Using Lucene, certain queries parse and execute in a completely unexpected way.

Here's the code for testing it (written in Scala, but can be easily translated to Java too):

import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.document._
import org.apache.lucene.index._
import org.apache.lucene.queryparser.classic.QueryParser
import org.apache.lucene.search.{IndexSearcher, Query}
import org.apache.lucene.store._
import scala.jdk.CollectionConverters._

object LuceneTestUtils{

  def testDocuments(
    docs: Seq[Map[String, String]],
    query: String,
    expected: Seq[Map[String, String]]
  ): Unit = {
    withIndex(docs) { searcher =>
      val analyzer = new StandardAnalyzer()
      val parser = new QueryParser("defaultField", analyzer)
      val q: Query = parser.parse(query)
      println(q)
      val hits = searcher.search(q, 1000).scoreDocs
      val results = hits.map { hit =>
        val doc = searcher.doc(hit.doc)
        doc.getFields.asScala.map(f => f.name() -> doc.get(f.name())).toMap
      }
      .toList

      val notExpected = results.diff(expected)
      assert(notExpected.isEmpty, s"Got unexpected documents:\n${notExpected.mkString("\n")}\nReceived documents:\n${results.mkString("\n")}")

      val missing = expected.diff(results)
      assert(missing.isEmpty, s"Missing expected documents: \n${missing.mkString("\n")}\nReceived documents:\n${results.mkString("\n")}")
    }
  }

  def withIndex(
      docs: Seq[Map[String, String]]
  )(test: IndexSearcher => Unit): Unit = {
    val analyzer = new StandardAnalyzer()
    val index: Directory =
      new org.apache.lucene.store.ByteBuffersDirectory(NoLockFactory.INSTANCE)
    val config = new IndexWriterConfig(analyzer)
    val writer = new IndexWriter(index, config)

    docs.foreach { fields =>
      val doc = new Document()
      fields.foreach { case (k, v) =>
        doc.add(new StringField(k, v, Field.Store.YES))
      }
      writer.addDocument(doc)
    }
    writer.close()

    val reader = DirectoryReader.open(index)
    val searcher = new IndexSearcher(reader)
    try {
      test(searcher)
    } finally {
      reader.close()
      index.close()
    }
  }
}

Dependencies used:

  "org.apache.lucene" % "lucene-core" % "9.9.2",
  "org.apache.lucene" % "lucene-queryparser" % "9.9.2",

I generate some test data:

  private val testData =  for {
      key1Value <- Seq(Some("value1"), Some("value2"), None)
      key2Value <- Seq(Some("value1"), Some("value2"), None)
      key3Value <- Seq(Some("value1"), Some("value2"), None)
      key4Value <- Seq(Some("value1"), Some("value2"), None)
    } yield Seq(
      key1Value.map("key1" -> _),
      key2Value.map("key2" -> _),
      key3Value.map("key3" -> _),
      key4Value.map("key4" -> _)
    ).flatten.toMap

Most queries work exactly as expected, e.g.:

    val expectedDocs = testData
      .filter(doc =>
        (doc.get("key1").contains("value2") || doc.get("key2").contains("value2")) &&
        (doc.get("key3").contains("value2") || doc.get("key4").contains("value2"))
      )

    LuceneTestUtils.testDocuments(testData, "(key1:value2 OR key2:value2) AND (key3:value2 OR key4:value2)", expectedDocs)

The following test cases show very unexpected behaviour:

   val expectedDocs = testData
      .filter(doc => 
        doc.get("key1").contains("value1") &&
        // This feels very wrong, it should be ||
        doc.get("key2").contains("value1") &&
        doc.get("key3").contains("value1") &&
        doc.get("key4").contains("value1")
      )

    LuceneTestUtils.testDocuments(testData, "key1:value1 AND key2:value1 OR key3:value1 AND key4:value1", expectedDocs)

    val expectedDocs = testData
      .filter(doc =>
        // ???
        // doc.get("key1").contains("value2") || (
        doc.get("key2").contains("value2") &&
          doc.get("key3").contains("value2")
        // ) || doc.get("key4").contains("value2")
      )

    LuceneTestUtils.testDocuments(testData, "key1:value2 OR key2:value2 AND key3:value2 OR key4:value2", expectedDocs)

No operator precedence could explain this.

AND before OR
AND equal to OR
- with either left-to-right or right-to-left associativity
AND after OR

A hint at the problem is the output of that println(q):

+(key1:value2 key2:value2) +(key3:value2 key4:value2)

+key1:value1 +key2:value1 +key3:value1 +key4:value1

key1:value2 +key2:value2 +key3:value2 key4:value2

These match the observed behaviour, but don't make sense. Why does Lucene work in that way?

edited Nov 5 at 18:38

Gaël J

15.9k5 gold badges24 silver badges47 bronze badges

asked Nov 5 at 13:17

Markus Appel

3,2682 gold badges23 silver badges51 bronze badges

I don't have a precise answer but the Classic QueryParser you're using is documented to be used by end users simple queries (link). Did you try another parser out of curiosity?

Gaël J
– Gaël J

2025-11-05 18:38:18 +00:00
Commented Nov 5 at 18:38
1

Related: Lucene operator precedence for boolean operators. There is also a great thread (old but still highly relevant) from the Lucene mailing list: Getting a Better Understanding of Lucene's Search Operators.

andrewJames
– andrewJames

2025-11-05 21:22:52 +00:00
Commented Nov 5 at 21:22
1

Bottom line: these "boolean" operators AND and OR were added to Lucene as a way to try to be more inuitive than + (meaning "must exist") along with "may exist" and "must not exist". These are the fundamental Lucene operators here. We have to remember that Lucene's overall objective is not a pure boolean "hit" or "miss" process - but rather a scoring process (more relevant vs. less relevant matches, ranked in order).

andrewJames
– andrewJames

2025-11-05 21:22:56 +00:00
Commented Nov 5 at 21:22
This question is similar to: Lucene operator precedence for boolean operators. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem.

K.Nicholas
– K.Nicholas

2025-11-06 02:22:03 +00:00
Commented Nov 6 at 2:22

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Lucene Boolean Operator Problems

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked