Algorithm (or pointer to literature) sought for string processing challenge

Question

A group of amusing students write essays exclusively by plagiarising portions of the complete works of WIlliam Shakespere. At one end of the scale, an essay might exclusively consist a verbatim copy of a soliloquy... at the other, one might see work so novel that - while using a common alphabet - no two adjacent characters in the essay were used adjacently by Will.

Essays need to be graded. A score of 1 is assigned to any essay which can be found (character-by-character identical) in the plain-text of the complete works. A score of 2 is assigned to any work that can be successfully constructed from no fewer than two distinct (character-by-character identical) passages in the complete works, and so on... up to the limit - for an essay with N characters - which scores N if, and only if, no two adjacent characters in the essay were also placed adjacently in the complete works.

The challenge is to implement a program which can efficiently (and accurately) score essays. While any (practicable) data-structure to represent the complete works is acceptable - the essays are presented as ASCII strings.

Having considered this teasing question for a while, I came to the conclusion that it is much harder than it sounds. The naive solution, for an essay of length N, involves 2**(N-1) traversals of the complete works - which is far too inefficient to be practical.

While, obviously, I'm interested in suggested solutions - I'd also appreciate pointers to any literature that deals with this, or any similar, problem.

CLARIFICATIONS

Perhaps some examples (ranging over much shorter strings) will help clarify the 'score' for 'essays'?

Assume Shakespere's complete works are abridged to:

"The quick brown fox jumps over the lazy dog."

Essays scoring 1 include "own fox jump" and "The quick brow". The essay "jogging" scores 6 (despite being short) because it can't be represented in fewer than 6 segments of the complete works... It can be segmented into six strings that are all substrings of the complete works as follows: "[j][og][g][i][n][g]". N.B. Establishing scores for this short example is trivial compared to the original problem - because, in this example "complete works" - there is very little repetition.

Hopefully, this example segmentation helps clarify the 2*(N-1) substring searches in the complete works. If we consider the segmentation, the (N-1) gaps between the N characters in the essay may either be a gap between segments, or not... resulting in ~ 2*(N-1) substring searches of the complete works to test each segmentation hypothesis.

An (N)DFA would be a wonderful solution - if it were practical. I can see how to construct something that solved 'substring matching' in this way - but not scoring. The state space for scoring, on the surface, at least, seems wildly too large (for any substantial complete works of Shakespere.) I'd welcome any explanation that undermines my assumptions that the (N)DFA would be too large to be practical to compute/store.

Where does this number, 2**(N-1), come from? To me it looks much worse, like N!, because we have to consider permutations here, haven't we? — pentadecagon
– pentadecagon, Commented Feb 7, 2014 at 4:31
You want to find minimal number of parts to which string A could be split, so that every part is substring of string B? — Evgeny Kluev
– Evgeny Kluev, Commented Feb 9, 2014 at 17:08
Yes, where A is the "essay" and B is the "complete works". B is constant. I need to (quickly) compute this minimal number for many values of A. — aSteve
– aSteve, Commented Feb 9, 2014 at 17:11

mcdowella · Accepted Answer · 2014-02-09 19:20:27Z

3

+50

A general approach for plagiarism detection is to append the student's text to the source text separated by a character not occurring in either and then to build either a suffix tree or suffix array. This will allow you to find in linear time large substrings of the student's text which also appear in the source text.

I find it difficult to be more specific because I do not understand your explanation of the score - the method above would be good for finding the longest stretch in the students work which is an exact quote, but I don't understand your N - is it the number of distinct sections of source text needed to construct the student's text?

If so, there may be a dynamic programming approach. At step k, we work out the least number of distinct sections of source text needed to construct first k characters of the student's text. Using a suffix array built just from the source text or otherwise, we find the longest match between the source text and characters x..k of the student's text, where x is of course as small as possible. Then the least number of sections of source text needed to construct the first k characters of student text is the least needed to construct 1..x-1 (which we have already worked out) plus 1. By running this process for k=1..the length of the student text we find the least number of sections of source text needed to reconstruct the whole of it.

(Or you could just search StackOverflow for the student's text, on the grounds that students never do anything these days except post their question on StackOverflow :-)).

I claim that repeatedly moving along the target string from left to right, using a suffix array or tree to find the longest match at any time, will find the smallest number of different strings from the source text that produces the target string. I originally found this by looking for a dynamic programming recursion but, as pointed out by Evgeny Kluev, this is actually a greedy algorithm, so let's try and prove this with a typical greedy algorithm proof.

Suppose not. Then there is a solution better than the one you get by going for the longest match every time you run off the end of the current match. Compare the two proposed solutions from left to right and look for the first time when the non-greedy solution differs from the greedy solution. If there are multiple non-greedy solutions that do better than the greedy solution I am going to demand that we consider the one that differs from the greedy solution at the last possible instant.

If the non-greedy solution is going to do better than the greedy solution, and there isn't a non-greedy solution that does better and differs later, then the non-greedy solution must find that, in return for breaking off its first match earlier than the greedy solution, it can carry on its next match for longer than the greedy solution. If it can't, it might somehow do better than the greedy solution, but not in this section, which means there is a better non-greedy solution which sticks with the greedy solution until the end of our non-greedy solution's second matching section, which is against our requirement that we want the non-greedy better solution that sticks with the greedy one as long as possible. So we have to assume that, in return for breaking off the first match early, the non-greedy solution gets to carry on its second match longer. But this doesn't work, because, when the greedy solution finally has to finish using its first match, it can jump on to the same section of matching text that the non-greedy solution is using, just entering that section later than the non-greedy solution did, but carrying on for at least as long as the non-greedy solution. So there is no non-greedy solution that does better than the greedy solution and the greedy solution is optimal.

edited Feb 9, 2014 at 19:20

answered Feb 7, 2014 at 6:54

mcdowella

19.6k2 gold badges21 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

aSteve Over a year ago

This approach doesn't seem to calculate the scores I need... which, I suspect, are substantially more complex than what can be computed using your approach. I've added a clarification of the 'score'... The interesting scores will be relatively large (in the range 1,000 to 1,000,000 - say.) BTW, the students and the plagiarism should be interpreted as a 'semi-plausible' background fiction to introduce the more abstract challenge.

Evgeny Kluev Over a year ago

@aSteve: IMHO this calculates scores exactly as you need and in linear time (which is optimal). The only unclear part is why it is named "dynamic programming approach". I think it would be better to name it "greedy" algorithm.

mcdowella Over a year ago

@EvgenyKluev yes you are right - in looking for a dynamic programming recursion I stumbled upon a greedy algorithm. I have added a section to my answer acknowledging this and attempting to justify it with a typical greedy algorithm justification.

aSteve Over a year ago

On reflection, while it wasn't initially intuitive, a greedy approach with suffix trees/arrays appears to be exactly what is needed. I had previously assumed a bogus justification for excluding a greedy approach. "ACCEPT!" :)

BassT · Accepted Answer · 2014-02-09 16:43:18Z

1

Have you considered using N-Grams to solve this problem?

http://en.wikipedia.org/wiki/N-gram

answered Feb 9, 2014 at 16:43

BassT

8698 silver badges23 bronze badges

2 Comments

aSteve Over a year ago

Sort-of... :) N-Grams look highly relevant... though (so far) I've not spotted an obvious efficient solution arising from them. One of my concerns is that - if I pre-compute/store all n-grams (of every length) of the complete-works, this could prove very space inefficient. Storing all 2-grams and 3-grams would be trivial - for example - but it is not obvious how to efficiently 'stitch-together' this local proximity data to accurately calculate a "score".

BassT Over a year ago

That's pretty much the thoughts I had... :)

Michael S · Accepted Answer · 2014-02-09 17:40:35Z

1

First read the complete works of Shakespeare and build a trie. Then process the string left to right. We can greedily take the longest substring that matches one in the data because we want the minimum number of strings, so there is no factor of 2^N. The second part is dirt cheap O(N).

The depth of the trie is limited by the available space. With a gigabyte of ram you could reasonably expect to exhaustively cover Shakespearean English string of length at least 5 or 6. I would require that the leaf nodes are unique (which also gives a rule for constructing the trie) and keep a pointer to their place in the actual works, so you have access to the continuation.

answered Feb 9, 2014 at 17:40

Michael S

1,8931 gold badge18 silver badges20 bronze badges

4 Comments

j_random_hacker Over a year ago

You're right that greediness will work, but I think it needs a bit more justification than just "because we want the minimum number of strings" :-P Suppose we have a greedy solution G, and another (possibly the same) solution S in which the first disagreement with G happens at the ith block: in G this block spans positions [j, k] in the complete work, while in S it spans [j', k'] with k'-j' <= k-j. Now just change S by replacing [j', k'] with [j, k], deleting any subsequent blocks that are now wholly contained within [j, k], and if there is a block [x, y] that partially overlaps [j, k]...

j_random_hacker Over a year ago

... so that the first character in it not covered by [j, k] starts at z > x in the complete work, shrinking it from [x, y] to [z, y]. This operation can always be performed, and only ever leaves the total number of blocks the same or less than before, so repeating it until there is no leftmost disagreeing block will transform S into G at the same or lower cost. This applies even if S is optimal, so G must be optimal too.

Michael S Over a year ago

Thanks @j_random_hacker, I wanted to include a proof but didn't find one concise enough for my tastes.

j_random_hacker Over a year ago

You're welcome! Also it looks like mcdowella came up with much the same proof in his/her answer.

waTeim · Accepted Answer · 2014-02-09 23:52:45Z

0

This feels like a problem of partial matching a very large regular expression.

If so it can be solved by a very large non deterministic finite state automata or maybe more broadly put as a graph representing for every character in the works of Shakespeare, all the possible next characters.

If necessary for efficiency reasons the NDFA is guaranteed to be convertible to a DFA. But then this construction can give rise to 2^n states, maybe this is what you were alluding to?

This aspect of the complexity does not really worry me. The NDFA will have M + C states; one state for each character and C states where C = 26*2 + #punctuation to connect to each of the M states to allow the algorithm to (re)start when there are 0 matched characters. The question is would the corresponding DFA have O(2^M) states and if so is it necessary to make that DFA, theoretically it's not necessary. However, consider that in the construction, each state will have one and only one transition to exactly one other state (the next state corresponding to the next character in that work). We would expect that each one of the start states will be connected to on average M/C states, but in the worst case M meaning the NDFA will have to track at most M simultaneous states. That's a large number but not an impossibly large number for computers these days.

The score would be derived by initializing to 1 and then it would incremented every time a non-accepting state is reached.

It's true that one of the approaches to string searching is building a DFA. In fact, for the majority of the string search algorithms, it looks like a small modification on failure to match (increment counter) and success (keep going) can serve as a general strategy.

edited Feb 9, 2014 at 23:52

answered Feb 7, 2014 at 7:04

waTeim

9,2542 gold badges39 silver badges40 bronze badges

2 Comments

aSteve Over a year ago

I'm broadly familiar with the literature on string searching - where (N)DFAs are superb solutions. However, I can't envision any viable strategy to construct a scoring (N)DFA from the complete works of Shakespeare (which are long)... I can only envision an automata with order 2**(M-1) states where M is the number of characters in the complete works... i.e. a number of states so large that it does not suggest a practicable solution.

waTeim Over a year ago

@aSteve I've updated answer with a comment on the bound of the complexity.

Collectives™ on Stack Overflow

Algorithm (or pointer to literature) sought for string processing challenge

4 Answers 4

4 Comments

2 Comments

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

2 Comments

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related