A group of amusing students write essays exclusively by plagiarising portions of the complete works of WIlliam Shakespere. At one end of the scale, an essay might exclusively consist a verbatim copy of a soliloquy... at the other, one might see work so novel that - while using a common alphabet - no two adjacent characters in the essay were used adjacently by Will.
Essays need to be graded. A score of 1 is assigned to any essay which can be found (character-by-character identical) in the plain-text of the complete works. A score of 2 is assigned to any work that can be successfully constructed from no fewer than two distinct (character-by-character identical) passages in the complete works, and so on... up to the limit - for an essay with N characters - which scores N if, and only if, no two adjacent characters in the essay were also placed adjacently in the complete works.
The challenge is to implement a program which can efficiently (and accurately) score essays. While any (practicable) data-structure to represent the complete works is acceptable - the essays are presented as ASCII strings.
Having considered this teasing question for a while, I came to the conclusion that it is much harder than it sounds. The naive solution, for an essay of length N, involves 2**(N-1) traversals of the complete works - which is far too inefficient to be practical.
While, obviously, I'm interested in suggested solutions - I'd also appreciate pointers to any literature that deals with this, or any similar, problem.
CLARIFICATIONS
Perhaps some examples (ranging over much shorter strings) will help clarify the 'score' for 'essays'?
Assume Shakespere's complete works are abridged to:
"The quick brown fox jumps over the lazy dog."
Essays scoring 1 include "own fox jump" and "The quick brow". The essay "jogging" scores 6 (despite being short) because it can't be represented in fewer than 6 segments of the complete works... It can be segmented into six strings that are all substrings of the complete works as follows: "[j][og][g][i][n][g]". N.B. Establishing scores for this short example is trivial compared to the original problem - because, in this example "complete works" - there is very little repetition.
Hopefully, this example segmentation helps clarify the 2*(N-1) substring searches in the complete works. If we consider the segmentation, the (N-1) gaps between the N characters in the essay may either be a gap between segments, or not... resulting in ~ 2*(N-1) substring searches of the complete works to test each segmentation hypothesis.
An (N)DFA would be a wonderful solution - if it were practical. I can see how to construct something that solved 'substring matching' in this way - but not scoring. The state space for scoring, on the surface, at least, seems wildly too large (for any substantial complete works of Shakespere.) I'd welcome any explanation that undermines my assumptions that the (N)DFA would be too large to be practical to compute/store.