Using Lucene, certain queries parse and execute in a completely unexpected way.
Here's the code for testing it (written in Scala, but can be easily translated to Java too):
import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.document._
import org.apache.lucene.index._
import org.apache.lucene.queryparser.classic.QueryParser
import org.apache.lucene.search.{IndexSearcher, Query}
import org.apache.lucene.store._
import scala.jdk.CollectionConverters._
object LuceneTestUtils{
def testDocuments(
docs: Seq[Map[String, String]],
query: String,
expected: Seq[Map[String, String]]
): Unit = {
withIndex(docs) { searcher =>
val analyzer = new StandardAnalyzer()
val parser = new QueryParser("defaultField", analyzer)
val q: Query = parser.parse(query)
println(q)
val hits = searcher.search(q, 1000).scoreDocs
val results = hits.map { hit =>
val doc = searcher.doc(hit.doc)
doc.getFields.asScala.map(f => f.name() -> doc.get(f.name())).toMap
}
.toList
val notExpected = results.diff(expected)
assert(notExpected.isEmpty, s"Got unexpected documents:\n${notExpected.mkString("\n")}\nReceived documents:\n${results.mkString("\n")}")
val missing = expected.diff(results)
assert(missing.isEmpty, s"Missing expected documents: \n${missing.mkString("\n")}\nReceived documents:\n${results.mkString("\n")}")
}
}
def withIndex(
docs: Seq[Map[String, String]]
)(test: IndexSearcher => Unit): Unit = {
val analyzer = new StandardAnalyzer()
val index: Directory =
new org.apache.lucene.store.ByteBuffersDirectory(NoLockFactory.INSTANCE)
val config = new IndexWriterConfig(analyzer)
val writer = new IndexWriter(index, config)
docs.foreach { fields =>
val doc = new Document()
fields.foreach { case (k, v) =>
doc.add(new StringField(k, v, Field.Store.YES))
}
writer.addDocument(doc)
}
writer.close()
val reader = DirectoryReader.open(index)
val searcher = new IndexSearcher(reader)
try {
test(searcher)
} finally {
reader.close()
index.close()
}
}
}
Dependencies used:
"org.apache.lucene" % "lucene-core" % "9.9.2",
"org.apache.lucene" % "lucene-queryparser" % "9.9.2",
I generate some test data:
private val testData = for {
key1Value <- Seq(Some("value1"), Some("value2"), None)
key2Value <- Seq(Some("value1"), Some("value2"), None)
key3Value <- Seq(Some("value1"), Some("value2"), None)
key4Value <- Seq(Some("value1"), Some("value2"), None)
} yield Seq(
key1Value.map("key1" -> _),
key2Value.map("key2" -> _),
key3Value.map("key3" -> _),
key4Value.map("key4" -> _)
).flatten.toMap
Most queries work exactly as expected, e.g.:
1.
val expectedDocs = testData
.filter(doc =>
(doc.get("key1").contains("value2") || doc.get("key2").contains("value2")) &&
(doc.get("key3").contains("value2") || doc.get("key4").contains("value2"))
)
LuceneTestUtils.testDocuments(testData, "(key1:value2 OR key2:value2) AND (key3:value2 OR key4:value2)", expectedDocs)
The following test cases show very unexpected behaviour:
2.
val expectedDocs = testData
.filter(doc =>
doc.get("key1").contains("value1") &&
// This feels very wrong, it should be ||
doc.get("key2").contains("value1") &&
doc.get("key3").contains("value1") &&
doc.get("key4").contains("value1")
)
LuceneTestUtils.testDocuments(testData, "key1:value1 AND key2:value1 OR key3:value1 AND key4:value1", expectedDocs)
val expectedDocs = testData
.filter(doc =>
// ???
// doc.get("key1").contains("value2") || (
doc.get("key2").contains("value2") &&
doc.get("key3").contains("value2")
// ) || doc.get("key4").contains("value2")
)
LuceneTestUtils.testDocuments(testData, "key1:value2 OR key2:value2 AND key3:value2 OR key4:value2", expectedDocs)
No operator precedence could explain this.
- AND before OR
- AND equal to OR
- with either left-to-right or right-to-left associativity
- AND after OR
A hint at the problem is the output of that println(q):
1.
+(key1:value2 key2:value2) +(key3:value2 key4:value2)
+key1:value1 +key2:value1 +key3:value1 +key4:value1
key1:value2 +key2:value2 +key3:value2 key4:value2
These match the observed behaviour, but don't make sense. Why does Lucene work in that way?
ANDandORwere added to Lucene as a way to try to be more inuitive than+(meaning "must exist") along with "may exist" and "must not exist". These are the fundamental Lucene operators here. We have to remember that Lucene's overall objective is not a pure boolean "hit" or "miss" process - but rather a scoring process (more relevant vs. less relevant matches, ranked in order).