Python difflib with long diff blocks

Ask Question

Asked 8 years, 10 months ago

Modified 8 years, 10 months ago

Viewed 1k times

I was using Python's difflib to create comprehensive differential logs between rather long files. Everything was running smoothly, until I encountered problem of never-ending diffs. After digging around, it turned out that difflib cannot handle long sequences of semi-matching lines. Here is a (somewhat minimal) example:

import sys
import random
import difflib

def make_file(fname, dlines):
    with open(fname, 'w') as f:
        f.write("This is a small file with a long sequence of different lines\n")
        f.write("Some of the starting lines could differ {}\n".format(random.random()))
        f.write("...\n")
        f.write("...\n")
        f.write("...\n")
        f.write("...\n")
        for i in range(dlines):
            f.write("{}\t{}\t{}\t{}\n".format(i, i+random.random()/100, i+random.random()/10000, i+random.random()/1000000))

make_file("a.txt", 125)
make_file("b.txt", 125)

with open("a.txt") as ff:
    fromlines = ff.readlines()
with open("b.txt") as tf:
    tolines = tf.readlines()

diff = difflib.ndiff(fromlines, tolines)

sys.stdout.writelines(diff)

Even for the 125 lines in the example, it took Python over 4 seconds to compute and print the diff, while for GNU Diff it took literally a few milliseconds. And I'm facing problems, where the number of lines is approx. 100 times larger.

Is there a sensible solution to the issue? I hoped for using difflib, as it produces rather nice HTML diffs, but I am open to suggestions. I need a portable solution, that would work on as many platforms as possible, although I am already considering porting GNU Diff for the matter :). Hacking into difflib is also possible as long as I wouldn't have to literally rewrite the whole library.

PS. The files might have variable-length prefixes, so splitting them into parts without aligning diff context might not be the best idea.

asked Jan 30, 2017 at 12:49

Marandil

1,0521 gold badge17 silver badges35 bronze badges

It seems it is just slow. See for instance: bugs.python.org/issue11740 (especially bugs.python.org/issue11740#msg132794 -- Differ._fancy_replace might be the thing to blame) stackoverflow.com/questions/4899146/… The suggested solution was to use diff or windiff via python wrapper.

kyticka
– kyticka

2017-01-30 13:02:40 +00:00
Commented Jan 30, 2017 at 13:02

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Python difflib with long diff blocks

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked