I have two big text files, near 2GB each. I need something like diff f1.txt f2.txt . Is there any way to do this task fast in python? Standard difflib is too slow. I assume there is faster way, because difflib is fully implemented in Python.
1 Answer
How about using difflib in way that you script can handle big files? Don't load the files in memory, but iterate through the files of the files and diff in chunks. For e.g 100 lines at a time.
import difflib
d = difflib.Differ()
f1 = open('bigfile1')
f2 = open('bigfile2')
b1 = []
b2 = []
for n, lines in enumerate(zip(f1,f2)):
if not (n % 100 == 0):
b1.append(lines[0])
b2.append(lines[1])
else:
diff = d.compare("".join(b1), "".join(b2))
b1 = []
b2 = []
print ''.join(list(diff))
diff = d.compare("".join(b1), "".join(b2))
print ''.join(list(diff))
f1.close()
f2.close()
2 Comments
Senthil Kumaran
Your other fast and portable option would be ask the users to install diff utility for the platform and then use that via python wrapper.
Pithikos
Python's
difflib is just slow no matter what you do. Two almost identical files of 1MB each, take me 0.5sec at best case and a few minutes at worst case. Binary diff takes 0.033s.
diff f1.txt f2.txt?