I was using Python's difflib to create comprehensive differential logs between rather long files. Everything was running smoothly, until I encountered problem of never-ending diffs. After digging around, it turned out that difflib cannot handle long sequences of semi-matching lines.
Here is a (somewhat minimal) example:
import sys
import random
import difflib
def make_file(fname, dlines):
with open(fname, 'w') as f:
f.write("This is a small file with a long sequence of different lines\n")
f.write("Some of the starting lines could differ {}\n".format(random.random()))
f.write("...\n")
f.write("...\n")
f.write("...\n")
f.write("...\n")
for i in range(dlines):
f.write("{}\t{}\t{}\t{}\n".format(i, i+random.random()/100, i+random.random()/10000, i+random.random()/1000000))
make_file("a.txt", 125)
make_file("b.txt", 125)
with open("a.txt") as ff:
fromlines = ff.readlines()
with open("b.txt") as tf:
tolines = tf.readlines()
diff = difflib.ndiff(fromlines, tolines)
sys.stdout.writelines(diff)
Even for the 125 lines in the example, it took Python over 4 seconds to compute and print the diff, while for GNU Diff it took literally a few milliseconds. And I'm facing problems, where the number of lines is approx. 100 times larger.
Is there a sensible solution to the issue? I hoped for using difflib, as it produces rather nice HTML diffs, but I am open to suggestions. I need a portable solution, that would work on as many platforms as possible, although I am already considering porting GNU Diff for the matter :). Hacking into difflib is also possible as long as I wouldn't have to literally rewrite the whole library.
PS. The files might have variable-length prefixes, so splitting them into parts without aligning diff context might not be the best idea.