11

I have two big text files, near 2GB each. I need something like diff f1.txt f2.txt . Is there any way to do this task fast in python? Standard difflib is too slow. I assume there is faster way, because difflib is fully implemented in Python.

8
  • 6
    Why not use diff f1.txt f2.txt? Commented Feb 4, 2011 at 14:34
  • 1
    @delnan: because it will make my script platform dependent. Get diff of files is only one of other parts of script Commented Feb 4, 2011 at 14:38
  • Is it feasible to try it with psyco acceleration or an Unladen Swallow or PyPy build? Commented Feb 4, 2011 at 15:02
  • 1
    For a little reference, can you tell us how long difflib is taking to compare the files on your computer and what kind of speedup you would like to see? Commented Feb 4, 2011 at 15:03
  • 1
    @chmullig I need to get two lists(or files). First list must contain strings that were added in second file, and second list must contain strings that were removed in second file. Commented Feb 4, 2011 at 16:21

1 Answer 1

6

How about using difflib in way that you script can handle big files? Don't load the files in memory, but iterate through the files of the files and diff in chunks. For e.g 100 lines at a time.

import difflib

d = difflib.Differ()

f1 = open('bigfile1')
f2 = open('bigfile2')

b1 = []
b2 = []

for n, lines in enumerate(zip(f1,f2)):
    if not (n % 100 == 0):
        b1.append(lines[0])
        b2.append(lines[1])
    else:
        diff = d.compare("".join(b1), "".join(b2))
        b1 = []
        b2 = []
        print ''.join(list(diff))

diff = d.compare("".join(b1), "".join(b2))
print ''.join(list(diff))
f1.close()
f2.close()
Sign up to request clarification or add additional context in comments.

2 Comments

Your other fast and portable option would be ask the users to install diff utility for the platform and then use that via python wrapper.
Python's difflib is just slow no matter what you do. Two almost identical files of 1MB each, take me 0.5sec at best case and a few minutes at worst case. Binary diff takes 0.033s.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.