diff two big files in Python

Question

I have two big text files, near 2GB each. I need something like diff f1.txt f2.txt . Is there any way to do this task fast in python? Standard difflib is too slow. I assume there is faster way, because difflib is fully implemented in Python.

@delnan: because it will make my script platform dependent. Get diff of files is only one of other parts of script — Mykola Kharechko
– Mykola Kharechko, Commented Feb 4, 2011 at 14:38
Is it feasible to try it with psyco acceleration or an Unladen Swallow or PyPy build? — ncoghlan
– ncoghlan, Commented Feb 4, 2011 at 15:02
For a little reference, can you tell us how long difflib is taking to compare the files on your computer and what kind of speedup you would like to see? — marr75
– marr75, Commented Feb 4, 2011 at 15:03
@chmullig I need to get two lists(or files). First list must contain strings that were added in second file, and second list must contain strings that were removed in second file. — Mykola Kharechko
– Mykola Kharechko, Commented Feb 4, 2011 at 16:21

Senthil Kumaran · Accepted Answer · 2011-02-04 15:07:39Z

6

How about using difflib in way that you script can handle big files? Don't load the files in memory, but iterate through the files of the files and diff in chunks. For e.g 100 lines at a time.

import difflib

d = difflib.Differ()

f1 = open('bigfile1')
f2 = open('bigfile2')

b1 = []
b2 = []

for n, lines in enumerate(zip(f1,f2)):
    if not (n % 100 == 0):
        b1.append(lines[0])
        b2.append(lines[1])
    else:
        diff = d.compare("".join(b1), "".join(b2))
        b1 = []
        b2 = []
        print ''.join(list(diff))

diff = d.compare("".join(b1), "".join(b2))
print ''.join(list(diff))
f1.close()
f2.close()

answered Feb 4, 2011 at 15:07

Senthil Kumaran

57.3k15 gold badges99 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Senthil Kumaran Over a year ago

Your other fast and portable option would be ask the users to install diff utility for the platform and then use that via python wrapper.

Pithikos Over a year ago

Python's difflib is just slow no matter what you do. Two almost identical files of 1MB each, take me 0.5sec at best case and a few minutes at worst case. Binary diff takes 0.033s.

Collectives™ on Stack Overflow

diff two big files in Python

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related